L Number 


Hits 


Search Text 


DB 


Time stamp 


1 


2204 


candidate with pair 


USPAT; 


2004/05/25 


08 


:38 






US-PGPUB; 
EPO; JPO; 
DERWENT; 
IBM TDB 








2 


77105 


equivalent with descri$7 


USPAT; 
US-PGPUB; 
EPO; JPO; 
DERWENT; 
IBM TDB 


2004/05/25 


08 


:39 


3 


69 


(candidate with pair) and (equivalent with 
descri$7 ) 


USPAT; 
US-PGPUB; 
EPO; JPO; 
DERWENT; 
IBM TDB 


2004/05/25 


08 


:42 


4 


4 


( (candidate with pair) and (equivalent 
with descri$7)) and queries and score and 
threshold 


USPAT; 
US-PGPUB; 
EPO; JPO; 
DERWENT; 
IBM TDB 


2004/05/25 


08 


:54 


5 


38 


(candidate with pair) and (queries with 
descri$6) 


USPAT; 
US-PGPUB; 
EPO; JPO; 
DERWENT; 
IBM TDB 


2004/05/25 


08 


:57 


6 


12 


( (candidate with pair) and (queries with 
descri$6) ) and score and threshold 


USPAT; 
US-PGPUB; 
EPO; JPO; 
DERWENT; 
IBM TDB 


2004/05/25 


08 


:49 


7 


5 


@pd<20020201 and (((candidate with pair) 


USPAT; 


2004/05/25 08 


:54 






and (queries with descri$6) ) and score and 


US-PGPUB; 












threshold) 


EPO; JPO; . 
DERWENT; 
IBM TDB 








8 


8 


@pd<20020201 and ( (candidate with pair) 
and (queries with descri$6) ) 


USPAT; 
US-PGPUB; 
EPO; JPO; 
DERWENT; 
IBM TDB 


2004/05/25 


08 


:54 


9 


82 


(candidate with pair) and queries and 
score and threshold 


USPAT; 
US-PGPUB; 
EPO;. JPO; 
DERWENT; 
IBM TDB 


2004/05/25 


08 


:54 


10 


24 


@pd<20020201 and ( (candidate with pair) 
and queries and score and threshold) 


USPAT; 
US-PGPUB; 
EPO; JPO; 


2004/05/25 


09 


:10 








-DERWENT;- - 
IBM TDB 








11 


1 


(@pd<20020201 and ((candidate with pair) 
and queries and score and threshold) ) and 
score and collection and synonym 


USPAT; 
US-PGPUB; 
EPO; JPO; 
DERWENT; 
IBM TDB 


2004/05/25 


08 


:58 


12 


248 


(candidate with pair) and (queries and 
descri$6) 


USPAT; 
US-PGPUB; 
EPO; JPO; 
DERWENT; 
IBM TDB 


2004/05/25 


08 


:57 


13 


26 


(candidate with pair) and (queries and 
(descri$6 with candidate) ) 


USPAT; 
US-PGPUB; 
EPO; JPO; 
DERWENT; 
IBM TDB 


2004/05/25 


08 


:57 


14 


3 


( (candidate with pair) and (queries and 
(descri$6 with candidate) ) ) and score and 
collection and synonym 


USPAT; 
US-PGPUB; 
EPO; JPO; 
DERWENT; 
IBM TDB 


2004/05/25 


09 


:07 
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15 


3045 


queries with terms 


USPAT; . 


2004/05/25 


09:08 






US-PGPUB; 
EPO; JPO; 
DERWENT; 
IBM TDB 






16 


727 


(queries with terms) and candidate 


USPAT; 
US-PGPUB; 
EPO; JPO; 
DERWENT; 
IBM TDB 


2004/05/25 


09:08 


17 


103 


( (queries with terms) and candidate) and 
collection and threshold and users and 
calculat$ and (frequen$ with occur$) 


USPAT; 
US-PGPUB; 
EPO; JPO; 
DERWENT; 
IBM TDB 


2004/05/25 


09:09 


18 


23 


@pd<20020201 and (({queries with terms) 
and candidate) and collection and 
threshold and users and calculat$ and 
(frequen$ with occur$)) 


USPAT; 
US-PGPUB; 
EPO; JPO; 
DERWENT; 
IBM TDB 


2004/05/25 


09:10 


20 


1 


( (@pd<20020201 and (((queries with terms) 
and candidate) and collection and 
threshold and users and calculat$ and 
(frequen$ with occur$))) and synonym) and 
.(m'isspell$ or (spell with check$)) 


USPAT; 
US-PGPUB; 
EPO; JPO; 
DERWENT ; 
IBM TDB 


2004/05/25 


09:10 


19 


8 


(@pd<20020201 and (((queries with terms) 
and candidate) and collection and 
threshold and users and calculat$ and 
(frequen$ with occur$) ) ) and synonym 


USPAT; 
US-PGPUB; 
EPO; JPO; 
DERWENT; 
IBM TDB 


2004/05/25 


09:16 


21 


1 


( ( (@pd<20020201 and (((queries with terms) 
and candidate) and collection and 
threshold and users and calculat$ and 
(frequen$ with occur$) ) ) and synonym) and 
(misspell$ or (spell with check$))) and 
pair 


USPAT; 
US-PGPUB; 
EPO; JPO; 
DERWENT; 
IBM_TDB 


2004/05/25 


09:13 


22 


0 


(707/3) .eels, and (@pd<20020201 and 

( (candidate with pair) and (queries with 

descri$6) ) ) 


USPAT; 
US-PGPUB; 
EPO; JPO; 
DERWENT; 
IBM TDB 


2004/05/25 


09:17 


23 


1 


(707/5) .eels, and (@pd<20020201 and 

( (candidate with pair) and (queries with 

descri$6) ) ) 


USPAT; 
US-PGPUB; 
EPO; JPO; 
DERWENT; 
IBM TDB 


2004/05/25 


09:17 


24 


0 


(707/4) .eels, and (@pd<20020201 and 
((candidate with pair) and (queries with 
descri$6) ) ) 


USPAT; 
US-PGPUB; 
EPO; JPO; 


2004/05/25 


09:17 








DERWENT ; 

IBM TDB 








2 


5897622. pn. 


USPAT; 
US-PGPUB; 
EPO; JPO; 
DERWENT; 
IBM TDB 


2004/05/25 


08:38 




2 


5819265. pn. 


USPAT; 
US-PGPUB; 
EPO; JPO; 
DERWENT; 
IBM TDB 


2004/05/10 


13:26 




0 


584103. apn. 


USPAT; 
US-PGPUB; 
EPO; JPO; 
DERWENT; 
IBM TDB 


2004/05/10 


13:58 




125 


document with proper with name 


USPAT; 
US-PGPUB; 
EPO; JPO; 
DERWENT; 
IBM TDB 


2004/05/10 


13:58 
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34 


(document with proper with name) and 
categor$ and rank$3 


USPAT; 
US-PGPUB; 
EPO; JPO; 
DERWENT; 
IBM TDB 


2004/05/10 


13:59 




10 


@pd<20000531 and ( (document with proper 
with name) and categor$ and rank$3) 


USPAT; 
US-PGPUB; 
EPO; JPO; 
DERWENT; 
IBM TDB 


2004/05/10 


14:00 




4 


(@pd<20000531 and ((document with proper 
with name) and categor$ and rank$3) ) and 
cluster 


USPAT; 
US-PGPUB; 
EPO; JPO; 
DERWENT; 
IBM TDB 


2004/05/10 


14:00 
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Search Text 
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Time stamp 


1 


6004 


LIST WITH QUER$3 


USPAT; 
US-PGPUB; 
EPO; 
IBM TDB 


2004/05/25 10:09 


2 


10343 


SET WITH QUER$3 


USPAT; 
US-PGPUB; 
EPO; 
IBM TDB 


2004/05/25 10:09 


3 


13688 


(LIST WITH QUER$3) OR (SET WITH QUER$3) 


USPAT; 
US-PGPUB; 
EPO; 
IBM_TDB 


2004/05/25 10:10 


4 


2187 


PAIR WITH QUER$3 


USPAT; 
US-PGPUB; 
EPO; 
IBM TDB 


2004/05/25 10:10 


5 


719 


((LIST WITH QUER$3) OR (SET WITH QUER$3)) 
AND (PAIR WITH QUER$3) 


USPAT; 
US-PGPUB; 
EPO; 
IBM TDB 


2004/05/25 10:10 


6 


35204 


PAIR WITH FREQUENCY 


USPAT; 
US-PGPUB; 
EPO; 
IBM TDB 


2004/05/25 10:10 


7 


50 


(((LIST WITH QUER$3) OR (SET WITH QUER$3)) 
AND (PAIR WITH QUER$3)) AND (PAIR WITH 
FREQUENCY) 


USPAT; 
US-PGPUB; 
EPO; 
IBM TDB 


2004/05/25 10:10 


8 


9 


((((LIST WITH QUER$3) OR (SET WITH QUER$3)) 
AND (PAIR WITH QUER$3)) AND (PAIR WITH 
FREQUENCY) ) AND SYNONYM 


USPAT; 
US-PGPUB; 
EPO; 
IBM TDB 


2004/05/25 10:11 


9 


6 


(((((LIST WITH QUER$3) OR (SET WITH QUER$3)) 
AND (PAIR WITH QUER$3)) AND (PAIR WITH 
FREQUENCY)) AND SYNONYM) AND @AD<20020201 


USPAT; 
US-PGPUB; 
EPO; 
IBM TDB 


2004/05/25 10:11 
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ABSTRACT 



m 



A document retrieval system where a user can enter a query, 
including a natural language query, in a desired one of a 
plurality of supported languages, and retrieve documents 
from a database that includes documents in at least one other 
language of the plurality of supported languages. The user 
need not have any knowledge of the other languages. Each 
document in the database is subjected to a set of processing 
steps to generate a ianguage-independeut conceptual repre- 
sentation of the subject content of the document. This is 
normally done before the query is entered. The query is also 
subjected to a (possibly different) set of processing steps to 
generate a language-independent conceptual representation 
of the subject content of the query. The documents and 
queries can also be subjected to additional analysis to 
provide additional term-based representations, such as the 
extraction of information-rich terms and phrases (such as 
proper nouns). Documents are matched to queries based on 
the conceptual-level contents of the document and query, 
and, optionally, on the basis of the term-based representa- 
tion. The query's representation is then compared to each 
document's representation to generate a measure of rel- 
evance of the document to the query. 

35 Claims, 8 Drawing^Sheets 
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For each word in a sentence, identify the word 
which is assigned with one Concept 




Count the number of occurences of each 
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Determine the Frequent Concept as the 
disambiguated sense of the word 



Compute the combined and normalized correlation 
between each concept assigned to the word with 
1) the Unique Concepts and 2) Frequent Concepts 
in the sentence 
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Determine which Concept(s) has the 
highest correlation 



Are there 
more than one 
^Concept with highest^ 
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^bigger than predetermined, 
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Choose one> 
-H or both 
k alternatives, 



Determine the Concept with 
the highest correlation as the 
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word 



Compute the combined and 
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each Concept assigned to 
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sense of the word 



Determine the Concept with 
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BACKGROUND OF THE INVENTION 

The present invention relates generally to computerized 
information retrieval, and more specifically to multilingual 

5 document retrieval. 

A global information economy requires an information 
utility capable of searching across multiple languages simul- 
taneously and seamlessly. However, when a scientist, patent 
attorney or patent examiner, student, or any information 

10 seeker conducts an electronic search for documents, that 
search is usually limited to texts in the searcher's native 
tongue, even though highly relevant information may be 
freely available in a foreign language. Searching for infor- 
mation across multiple languages invariably proves daunt- 

15 ing and expensive, or fruitless and inefficient, and is there- 
fore rarely done. 

Patent searching is but one example where limitations of 
language pose significant obstacles. In prior art terms, all 
languages are created equal. As a practical matter, a patent 

20 examiner in a given country tends to have the most mean- 
ingful access to documents in that country's language. Since 
the most pertinent prior art may be in a different language, 
patent examiners are often prevented from carrying out an 
effective examination of patent applications. 

25 The conventional approach to multilingual retrieval is to 
translate all texts into one common language, then perform 
monolingual indexing and retrieval. Such systems have 
several disadvantages. First, the machine translation 
process, although fully-automated, is often time-consuming 

30 and expensive. It is also highly inefficient, since all docu- 
ments must be translated even though only a small fraction 
of documents will be relevant to any given query. 

Second, the process of translation inevitably introduces 
errors and ambiguities into the translated document, making 

35 subsequent indexing and retrieval troublesome. For 
example, translation systems perform poorly with special- 
ized discourse (medicine, law, etc.), and are often unable to 
disambiguate polysemous words (those words with multiple 
meanings) correctly. 

40 

SUMMARY OF THE INVENTION 

The present invention provides document retrieval tech- 
niques that enable a user to enter a query, including a natural 

45 language query, in a desired one of a plurality of supported 
languages, and retrieve documents from a database that 
includes documents in at least one other language of the 
_ pJurajity^o^uprjortedJang^ages.-The user-need not have any - 
knowledge of the other languages. The present invention 

50 thus makes simultaneously searching multiple languages 
viable and affordable. Even if the documents of interest are 
all in one language, the invention gives a user whose native 
language is different the ability to enter queries in the user's 
native language. 

55 In short, each document in the database is subjected to a 
set of processing steps to generate a language-independent 
conceptual representation of the subject content of the 
document. This is normally done before the query is entered. 
The query is also subjected to a (possibly different) set of 

60 processing steps to generate a language-independent con- 
ceptual representation of the subject content of the query. 
The documents and queries can also be subjected to addi- 
tional analysis to provide additional term-based 
representations, such as the extraction of information-rich 

65 terms and phrases (such as proper nouns). 

Documents are matched to queries based on the 
conceptual-level contents of the document and query, and, 
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optionally, on the basis of the term-based representation. For such as a title, an abstract, or one or more clauses, sentences, 

example, the matching can be based in part on the or paragraphs. A document will typically be a member of a 

co-occurrence of information-rich terms and phrases, or document database, referred to as a corpus, containing a 

appropriate expansions or synonyms. large number of documents. Such a corpus can contain 

The query's representation is then compared to each 5 documents in any or all of the plurality of supported Ian- 

document's representation to generate a measure of rel- 8^8^- . « « «« « « 

~ , . 4 , r. i. , , . Unless otherwise stated, the term query should be taken 

evance of the document to the queiy. Results can be browsed mefln ^ ^ ^ ^ a subset 

using a graphical interface, and individual documents (or of documeQts from a document daUbase. While most que- 

document clusters) that seem highly relevant can be usedto ries entered b a usef t£nd tQ ^ shm ^ t0 most 

inform subsequent queries for relevance feedback. The 10 documcnts slorcd in mc databasc> mis should not be 

system may also perform a surface-level, gloss translitera- assumed. The present invention is designed to allow natural 

tion of the foreign text, sufficient enough for a non-fluent language queries. 

reader to gain a basic understanding of the document's Unless otherwise stated, the term "word" should be taken 

contents. to include single words, compound words, phrases, and 

In specific embodiments, the language-independent con- 15 other multi-word constructs. Furthermore, the terms "word" 

ceptual representation of the subject content of the and "term" are often used interchangeably. Terms and words 

document, and that of the query, is a fixed-length vector include, for example, nouns, proper nouns, complex 

based on a set of subject content categories and subcatego- nominals, noun phrases, verbs, adverbs, numeric 

ries. A current implementation supports English, French, expressions and adjectives. Ibis includes stemmed and 

German, Spanish, Dutch, and Italian. However, the system 20 non-stemmed forms 

is modular, and as additional languages are added to the nt disclosures of all articles and references, including 

document databases, those languages become searchable. patent documents, mentioned in this application are incor- 

„ . . , . ' , , porated herein by reference as if set out in full. 

The invention, by abstracting the documents and queries 2 Q System Hardware Overview 
into language-independent conceptual form, avoids the need 25 na x fc a simplified block ^ Iam of a ter tem 

for machine translation of the query or the database of 10 cmbod in thc ^lingual text retrieval system of the 
documents Only those documents which appear highly ^ ^ icaU im ; iemented ^ 

relevant to the searcher need be considered as candidates for a c]iMrvet configuration including a server 20 and 

translation (human or machine). numerous clients, one of which is shown at 25. The use of 

A further understanding of the nature and advantages of 30 the term "server" is used in the context of the invention, 

the present invention may be realized by reference to the where the server receives queries from (typically remote) 

remaining portions of the specification and the drawings. clients, does substantially all the processing necessary to 

RRIFF DFSPRIPTION OF THF DRAWINGS formulate responses to the queries, and provides these 

BRIEF DESCRIPTION OF THE DRAWINGS responses to the clients. However, server 20 may itself act in 

FIG. 1 is a block diagram of a multilingual information 35 the capacity of a client when it accesses remote databases 

retrieval system embodying the present invention; located on a database server. Furthermore, while a client- 

FIG. 2 is a block diagram of the text processing portion scrver configuration is shown, the invention may be imple- 

of the system* mented as a standalone facility, in which case client 25 

t-i^>o <* a ' j <m * i .i_ • , a , would be absent from the figure. 

FIGS. 3A and 3B, talrcn together, provide a flowchart w ^ hardware confi ^ „ ^ ^ standara> and 

di3 ItoXoGDV mull±D ^ 1 «"«*•* S™* will be described only briefly. In accordance with tawn 

^ * practice, server 20 includes one or more processors 30 that 

FIG. 4 is a high-level diagram showing the processing of communicate with a number of peripheral devices via a bus 

French input text to a monolingual concept vector; subsystem 32. These peripheral devices typically include a 

FIG. 5 is a more detailed diagram showing the two stages 45 storage subsystem 35 (memory subsystem and file storage 

of disambiguation in the processing of French input text to subsystem), a set of user interface input and output devices 

a monolingual concept vector; 37, and an interface to outside networks, including the 

FIG. 6 shows an example ofa portion of_the_processing public switched-telephone networkrThisinterface is^sKown" 

in a monolingual system; and schematically as a "Modems and Network Interface" block 

nG.7showsalogicaltreerepresentationofanexemplary 50 4 ,°> is ^P 1 ^ t0 corresponding interface devices in 

q Uery client computers via a network connection 45. 

Client 25 has the same general configuration, although 

DESCRIPTION OF SPECIFIC EMBODIMENTS typically with less storage and processing capability. Thus, 

1.0 Introduction while the client computer could be a terminal or a low-end 

The present invention is embodied in a multilingual 55 personal computer, the server computer would generally 

document retrieval system, 10, sometimes referred to as need to be a high-end workstation or mainframe. Corre- 

CINDOR (Conceptual INterlingua DOcument Retrieval). sponding elements and subsystems in the client computer 

Thc CINDOR system is capable of accepting a user's query are shown with corresponding, but primed, reference numer- 

stated in any one of a plurality of supported languages while als. 

seamlessly searching, retrieving and relevance-ranking 60 The user interface input devices typically includes a 

documents written in any of the supported languages. The keyboard and may further include a pointing device and a 

system further offers a "gloss" transliteration of target scanner. The pointing device may be an indirect pointing 

documents, once retrieved, sufficient for a surface under- device such as a mouse, trackball, touchpad, or graphics 

standing of the document's contents. tablet, or a direct pointing device such as a touchscreen 

Unless otherwise stated, the term "document" should be 65 incorporated into the display. Other types of user interface 

taken to mean text, a unit of which is selected for analysis, input devices, such as voice recognition systems, are also 

and to include an entire document, or any portion thereof, possible. 
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The user interface output devices typically include a of the CINDOR system. The CINDOR software is designed 

printer and a display subsystem, which includes a display to (1) process text stored in digital form or entered in digital 

controller and a display device coupled to the controller. The form on a computer terminal to create a database file 

display device may be a cathode ray tube (CRT), a flat -panel recording the manifold contents of the text, and (2) match 

device such as a liquid crystal display (LCD), or a projection 5 discrete texts (documents) to the requirements of a user's 

device. Display controller provides control signals to the query text. CINDOR provides rich, deep processing of text 

display device and normally includes a display memory for by representing and matching documents and queries at the 

storing the pixels that appear on the display device. The lexical, syntactic, semantic and discourse levels, not simply 

display subsystem may also provide non-visual display such by detecting the co-occurrence of words or phrases. A user 

as audio output. 10 of the system is able to enter queries, in the user's own 

The memory subsystem typically includes a number of language, as fully-formed sentences, with no requirement 
memories including a main random access memory (RAM) for special coding, annotation or the use of logical operators, 
for storage of instructions and data during program execu- The system is modular and performs staged processing of 
tion and a read only memory (ROM) in which fixed instruc- documents, with each module adding a meaningful annota- 
tions are stored. In the case of Macintosh-compatible per- 15 tion to the text. For matching, a query undergoes analogous 
sonal computers the ROM would include portions of the processing to determine the requirements for document 
operating system; in the case of IBM-compatible personal matching. The system generates both conceptual and term- 
computers, this would include the BIOS (basic input/output based alternative representations of the documents and que- 
system). ries. 

The file storage subsystem provides persistent (non- 20 The server's storage subsystem 35, as shown in FIG. 1, 

volatile) storage for program and data files, and typically contains the basic programming and data constructs that 

includes at least one hard disk drive and at least one floppy provide the functionality of the CINDOR system. The 

disk drive (with associated removable media). There may processing modules include a set of processing engines, 

also be other devices such as a CD-ROM drive and optical shown collectively in a processing engine block 50, and a 

drives (all with their associate removable media). 25 query-document matcher 55. It should be understood, 

Additionally, the system may include drives of the type with however, that by the time a user is entering queries into the 

removable media cartridges. The removable media car- system, the relevant document databases will have been 

tridges may, for example be hard disk cartridges, such as processed and annotated, and various data files and data 

those marketed by Syquest and others, and flexible disk constructs will have been established. These are shown 

cartridges, such as those marketed by Iomega. As noted 30 schematically as a "Document Database and Associated 

above, one or more of the drives may be located at a remote Data" block 60, referred to collectively below as the docu- 

location, such as in a server on a local area network or at a ment database. An additional set of resources 65, possibly 

site on the Internet's World Wide Web. including some derived from the corpus at large, is used by 

In this context, the term "bus subsystem" is used generi- the processing engines in connection with processing the 

cally so as to include any mechanism for letting the various 35 documents and queries. As will be described below, 

components and subsystems communicate with each other resources 65 include a number of multilingual resources, 
as intended. With the exception of the input devices and the User interface software 70 allows the user to interact with 

display, the other components need not be at the same the system. The user interface software is responsible for 

physical location. Thus, for example, portions of the file accepting queries, which it provides to processing engines 

storage system could be connected via various local-area or 40 50. The user interface software also provides feedback to the 

wide-area network media, including telephone lines. user regarding the query, and, in specific embodiments 

Similarly, the input devices and display need not be at the ' accepts responsive feedback from the user in order to 

same location as the processor, although it is anticipated that reformulate the query. The user interface software also 

the present invention will most often be implemented in the presents the results of the query to the user and reformats the 

context of PCs and workstations. 45 output in response to user input. User interface software 70 

Bus subsystem 32 is shown schematically as a single bus, is preferably implemented as a graphical user interface 

but a typical system has a number of buses such as a local (GUI), and will often be referred to as the GUI. 

bus and one or more expansion buses (e.g., ADB, SCSI,JSA,_ Processingof documents and queries follows a "modular" 

-EISA, MCA^ NuBus, or PCI),~as well "as serial and parallel progression, with documents being matched to queries based 

ports. Network connections are usually established through 50 on matching (1) their conceptual-level contents, and (2) 

a device such as a network adapter on one of these expansion various term-based and logic representations such as the 

buses or a modem on a serial port. The client computer may frequency/co-occurrence of proper nouns. At the conceptual 

be a desktop system or a portable system. level of matching, each substantive word in a document or 

The user interacts with the system using user interface query is assigned a concept category, and these category 

devices 37' (or devices 37 in a standalone system). For 55 frequencies are summed to produce a vector representation 

example, client queries are typically entered via a keyboard, of the whole text. Proper nouns are considered separately 

communicated to client processor 30, and thence to modem and, using a modified, frizzy Boolean representation, match- 

or network interface 40' over bus subsystem 32'. The query ing occurs based on the frequency and co-occurrence of 

is then communicated to server 20 via network connection proper nouns in documents and queries. The principles 

45. Similarly, results of the query are communicated from 60 applied to the proper noun matching are applicable to 

the server to the client via network connection 45 for output matching for other terms and parts of speech, such as 

on one of devices 37' (say a display or a printer), or may be complex nominals (CNs) and single terms, 
stored on storage subsystem 35'. While FIG. 1 shows documents and queries being 

3.0 Text Processing (Software) Overview processed, it should be understood that the documents would 

3.1 Basic Functionality 65 normally have been processed during an initial phase of 

The server's storage subsystem 35 shows the basic pro- setting up the document database and related structures, with 

gramming and data constructs that provide the functionality relevant information extracted from the documents and 
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indexed as part of the database. Accordingly, in the discus- a score for each document. The output scores are processed 

sion that follows, when reference is made to documents and by recall predictor 240 so as to select a proper set for output, 

queries being processed in a particular way, it is generally to TTie results are stored, and typically presented to a user for 

be understood that the processing of documents and queries browsing at GUI 250. 

would be occurring at different times. 5 The processing modules can be grouped at a higher level. 

3.2 Processing Module Overview Preprocessor 110, LI 120, POS tagger 130, and PNC 140 

FIG. 2 is a block diagram showing the set of modules that perform initial processing for tagging and identification; 

form processing engines 50, query-document matcher 55, MCGRE 150, MCGD 160, MCG-MHCM 170, MHCD 180, 

and user interface software 70. Documents and queries are and MCVG 190 generate conceptual-level representations 

processed by this set of modules that provide a language- 10 of the documents and queries; PTI 210 and PQP 220 

independent conceptual representation of each document generate term-based representations of the documents and 

and query. (As mentioned above, the documents and queries queries; MCVM 200 and QDM and score combiner 230 

are also subjected to separate processing.) In this context, correlate the document and query information to provide an 

the modifier "language-independent" means that the docu- evaluation of the documents; and recall predictor 240 and 

ments and queries are all abstracted to a set of categories 15 GUI 250 are concerned with presenting the results to the 

expressed in a common representation without regard to user. 

their original language. This processing is distinct from A number of the processing modules mentioned above 

machine translation, as will be seen below. This does not rely on associated resources, including databases and the 

mean, however, that retrieved documents could not then be like. While these resources will be described in connection 

translated, by machine or otherwise, if deemed appropriate 20 with the following detailed descriptions of the modules, they 

by the user. are enumerated here for clarity. 

The set of modules that perform the processing to gen- PNC 140: 

erate the conceptual representation and the term-based rep- proper noun knowledge databases (PKND). 

resentation includes: MCGRE 150: 

a preprocessor 110, 25 multilingual concept database (MCD). 

a language identifier (U) 120, MCGD 160: 

a part of speech (POS) tagger 130, multilingual concept group n-gram probability database, 

a proper noun categorizer (PNC) 140, multilingual concept group correlation matrix 

a multilingual concept group retrieval engine (MCGRE) 30 (MCGCM), and 

150 ~ frequency database. 

1 ♦ a- u- ♦ AlPrm MCG-MHCM 170: 

a multilingual concept group disambiguator (MCGD) .. , . , ... % 

i monolingual hierarchical concept dictionary (MHCD). 

' . , . ... .. . MHCD 180: 

a multihngual concept group to monolingual hierarchical .. . . 4 . . ~ ~ 

concept mapper (MCG-MHCM) 170, 35 mono^igud category correlaUon matnx (MCCM). 

a monolingual hierarchical concept category disambigu- index 

ator (MHCD) 180, Wfaat foUows ^ a modulc by modu]c description of the 

a monolingual category vector generator (MCVG) 190, system. 

a monolingual category vector matcher (MCVM) 200, 40 4.0 Initial Processing and Tagging 

a probabilistic term indexer (PTI) 210, 4.1 Preprocessor 110 

a probabilistic query processor (PQP) 220, Preprocessor 110 accepts raw, unformatted text and trans- 

. , /^-wv , fere this to a standard format suitable for further processing 

a ijucry to document matcher (QDM) and score combiner by aND0R . m ^processor performs document-level 

■ 45 processing as follows: 

a recall predictor 240, and xhe beginning and end of documents are identified and 

a graphical user interface (GUI) 250. marked. 

The output of MCVG 190 is a monolingual^ategory. Discourse-Level taggrog-occurerVarious~fiel^ 

—vector (also referred to as'the semantic vector, or simply the types are identified and tagged in a document, including 

vector) for each document and query, and represents the 50 "headline," "sub-text headline," "date," and "caption." 

documents or query at a language-independent conceptual All text is annotated with SGML-like tags (standard 

level. The query's monolingual category vector is matched generalized markup language, set forth as ISO standard 

or compared with monolingual category vectors of the 8879). 

documents by MCVM 200. The output from MCVM 200 4.2 Language Identifier (LI) 120 

provides a measure of relevance (score) for each document 55 LI 120 determines by means of a combination of n-gram 

with respect to the query. and word frequency analysis the language of the input 

While this information alone could be used to rank document. The output of the LI is the document plus its 

documents, it is preferred to subject the documents and the language identification tag. 

queries to an additional set of operations to provide addi- Two parallel approaches for language identification are 

tional bases for evaluating relevance. To this end, the 60 employed. The first approach operates by scanning docu- 

document information output from PNC 140 is com muni- ments for a distribution of language -discriminant, common 

cated to PTI 210, while the query information from MCGD single words. The occurrence, frequency and distribution of 

160 is communicated to PQP 220. PTI 210 and PQP 220 these words in a document is compared against the same 

provide term-based representations of the documents and distributions gathered from a representative corpus of docu- 

query, respectively. 65 ments in each of the supported languages. The second 

The outputs from MCVM 200, PTI 210, and PQP 220 are approach involves locating common word/character 

evaluated by QDM and score combiner 230, which provides sequences unique to each language. Such sequences may 
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form actual words that often occur, such as conjunctions, or 
a mix of words,.punctuation and character strings. Language 
identification involves scanning each document until a target 
character sequence is located. 

It should be realized that the LI is not necessary if the 
documents are already tagged as to their language. 

43 Part of Speech (POS) Tagger 130 

The language dependent, probabilistic, POS tagger 130 
determines the appropriate part of speech for each input 
word in the document and outputs a part of speech tagged 
document, plus its language identification tag. 

POS tagger 130 is used to identify various substantive 
words such as nouns, verbs, adjectives, proper nouns, and 
adverbs in each of the supported languages. Various func- 
tional words such as conjuncts are tagged as stop-words and 
are not used for matching purposes. Each language-specific 
POS tagger is a commercial off-the-shelf (COTS) technol- 
ogy- 

4.4 Proper Noun Identifier & Categorize r (PNQ 140 

In addition to the parts of speech processing in POS tagger 
130, additional processing of proper nouns occurs in a 
separate processing module, namely PNC 140, which per- 
forms the following tasks: 

Identifies and tags adjacent proper nouns in a text using 
the Proper Noun Boundary Identifier (PNBI). The PNBI 
uses various heuristics developed through multilingual cor- 
pus analysis to bracket adjacent proper nouns (e.g., IBM 
Corporation) and bracket proper nouns with embedded 
conjunctions and prepositions (e.g., the Bill of Rights). For 
example, one heuristic takes the form of a database of proper 
nouns such as University or Mayor that are frequently linked 
to proximate proper nouns by the preposition "of/* In 
another scheme, specific instantiations of adjacent proper 
nouns can be stored in a database. Each supported language 
has an independent array of tools and embedded databases 
for detecting and tagging adjacent proper nouns. 

Normalizes each proper noun to its standard form. For 
example, "IBM" and the colloquial "Big Blue" are both 
normalized to the standard form of "International Business 
Machines, Inc." in the knowledge database. 

Expands group proper nouns to their constituent members 
using the proper noun knowledge databases. For example, 
the group proper noun "European Community" is expanded 
to all member countries (Great Britain, France, Germany, 
etc.). Later matching would consider all expansions on the 
original proper noun group. 

Assigns monolingual concept-level categories from a 
-proper- noun hierarchical classification~scbeme~td certain" 
proper nouns or portions of proper nouns. The proper noun 
classification scheme is based on algorithmic machine -aided 
corpus analysis in each supported language. In a specific 
implementation, the classification is hierarchical, consisting 
of nine branch nodes and thirty terminal nodes. Clearly, this 
particular hierarchical arrangement of codes is but one of 
many arrangements that would be suitable. Table 1 shows a 
representative set of proper noun concept categories and 
subcategories. 

TABLE 1 

Proper Noun Categories and Subcategories 



30 



35 



40 



45 



50 



55 



10 



TABLE 1-continued 



10 



15 



20 



25 



Proper Noun Categories 


and Subcategories 


Island 


Document 


County 


Equipment: 


Province 


Software 


Country 


Hardware 


Continent 


Machines 


Region 


Scientific: 


Water 


Disease 


Geographic Miscellaneous 


Drugs 


Affiliation: 


Chemicals 


Religion 


Temporal: 


Nationality 


Date 


Organization: 


Time 


Company 


Miscellaneous: 


Company Type 


Miscellaneous 


Government 




U.S. Government 




Organization 





Geographic Entity: 

City 
Port 
Airport 



Human: 

Person 

Title 

Document: 



Classification is accomplished by reference to an array of 
knowledge bases and context heuristics, which collectively 
define the proper noun knowledge database (PNKD). The 
PNTKD was built by analyzing a large corpus of texts, and 
contains the following different types of information which 
are used to categorize and standardize proper nouns in texts: 

(1) lists of common prefixes and suffixes which suggest 
certain types of proper noun categories; 

(2) lists of contextual linguistic clues which suggest 
certain types of proper noun categories; 

(3) lists of commonly used alternative names of the highly 
frequent proper nouns; and 

(4) lists of highly common proper nouns and the catego- 
ries to which the proper nouns belong. 

Classification includes (but is not limited to) company 
name, organization names, geographic entities, government 
units, government and political officials, patented and trade- 
marked products, and social institutions. Monolingual 
proper noun concept categories are used to help form the 
monolingual category vector representation of both the 
document and query (see later descriptions). As noted above, 
the documents and queries output from PNC 140 are com- 
municated to MGGRE 150, while" in the specific implement 
tation the documents only are communicated to PT1 210. 
5.0 Generation of Conceptual Level Representation 

5.1 Multilingual Concept Group Retrieval Engine 
(MCGRE) 150 

Modules 150 through 190 (i.e., MCGRE 150, MCGD 
160, MCG-MHCM 170, MHCD 180, and MCVG 190) are 
used to generate monolingual category vector codes of the 
subject-contents of both documents and queries. This pro- 
cess involves recognizing various information-rich words or 
parts of speech in a native language text, assigning a single 
code to these words or phrases that establishes its 
60 conceptual-level meaning, then mapping this conceptual- 
level representation to an English language, hierarchical 
system of concept codes for vector creation. 

The first of these modules, MCGRE 150, accepts the 
language-identified, part-of-speech tagged, input text and 
retrieves from the multilingual concept database any and all 
of the concept groups to which each input word belongs. 
Polysemous words (those words with multiple meanings) 
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will have multiple concept group assignments at this stage. Unique . Further, if there are any concept group codes which 

The output of the MCGRE.150, when run over a document, . . have been assigned to more than a predetermined number of 

will be sentence-delimited strings of words, each word or words within the sentence being processed, these concept 

phrase of which has been tagged with the codes of all the group codes are considered Frequentc odes. These two types 

multilingual concept groups to which various senses of the 5 0 f locally determined concept group codes are used as 

word/phrase belongs. "anchors" in the sentence for disambiguating the remaining 

This process incorporates: words jf my of ^ am biguous (polysemous) words in the 

(a) Deinflection of words (finding their root form); sentence have either a Unique or Frequent concept group 

(b) Locating clitics (articles or pronouns attached to coc je amongst their codes, that concept group code is 
words or punctuation, as with the French "renfant"); 10 selected and that word is thereby disambiguated. 

(c) Identifying and splitting compound words (words FIG. 3Ashows this process where MCGD 160 determines 
consisting of two or more linked words); and whether a given multilingual concept group code is Unique 

(d) Mapping each word to all possible corresponding or Frequent, and further whether a given ambiguous word 
concept categories using the multilingual concept data- has a Unique or Frequent code as one of its assigned codes. 
base^VICD). is To the extent that the word is associated with a Unique or 

The MCD is a language-mdependent knowledge database F Qt codc> that Uni Qr F Qt codc ^ ^ 

comprismgaco^ Ho a WQrd ^ faas oo oycrl ^ 

There are about 10,000 concept groups in a current imple- . . , T , . r- « 

mentation. Within each concept group is a collection of ™^J™P "des and the Unique or Frequent concept 

words or phrases, in multiple knguages, that are conceptu- f rou P cod , es f or * at sente "f can ^ be disambiguated using 

aUy synonymous or near-synonymous. Usually all members 20 local COD ^ xt evidence and must be evaluated by the next 

of a given concept group belong to the same part of speech. of taguisttc evidence, Domam Knowledge. 

It is possible that many words in a given language will occur 522 Domain Knowledge 

in a given concept group, or that a given word or phrase will Domain Knowledge representations reflect the extent to 
occur in multiple concept groups. The number of concept which words of one concept group tend to co-occur with 
groups that a given word or phrase occupies is dependent on 25 words of the other concept groups (hence the notion of the 
the degree of polysemy of that word or phrase. For example, domain predicting the sense). For each word which has not 
a word that has three possible senses may occupy three had one of its multiple concept group codes selected using 
different concept groups. Each group is considered a local information, the system consults the multilingual con- 
language-independent concept. Note that the MCD differs cept group correlation matrix (MCGCM) to select an appro- 
from a thesaurus because the concept groups are not linked 30 priate concept group code from the multiple concept group 
by broader or narrower relations. The MCD differs from a codes attached to the input word. 

dictionary translation because the MCD grouping is by The MCGCM is an optional knowledge database that 

synonymous words, not by translation definition. reflects observed document level co-occurrence patterns 

5.2 Multilingual Concept Group Disambiguator (MCGD) across a large corpus of single language documents. This 

160 35 correlation matrix is built from the training data to be used 

The input to MCGD 160 is the fully-tagged text stream as an additional knowledge source to disambiguate multiple 

from MCGRE 150 with polysemous words having multiple concept groups which are assigned to the terms in both query 

concept-category tags. The function of the MCGD is to and documents. The training data which is used to construct 

select the single most appropriate concept group from the the correlation matrix is either all possible concept groups 

multilingual concept database for all those input words for 40 assigned to each term in the texts, or the partially disam- 

which multiple concept group tags have been retrieved. The biguated concept groups in the texts. Thus, the construction 

output of the MCGD is a fully-tagged text stream with a of the correlation matrix does not require manual interven- 

single multilingual concept group for each word in the input tion. 

text. The processing performed by this module is similar to This correlation matrix is constructed from the correlation 

that discussed in copending commonly-owned U.S. patent 45 information among all concept groups assigned to terms in 

application Ser. No. 08/135,815, filed Oct. 12, 1993, entitled one document. The collection of the correlation information 

"Natural Language Processing System For Semantic Vector is summed and normalized to get the stable correlation 

Representation Which _Accounts For Lexical Ambiguity,"_to among all possible concept groups (i;e., each concept group - 

ElizabeuTDrLidd)^ Woojin Paik, and Edmund Szu-li Yu, will have a correlation value against all the other possible 

though modified for a multilingual system. The application 50 concept groups.) 

mentioned immediately above, hereinafter referred to as The MCGCM consists of unweighted Pearson's product - 

"Natural Language Processing," is hereby incorporated by moment correlation coefficients for all of the multilingual 

reference for all purposes. database concept group pairs using within-document occur- 

FIGS. 3 A and 3B, taken together, provide a flowchart rences as the unit of analysis. The result will be correlation 

showing the operation of MCGD 160. MCGD 160 processes 55 scores for each concept group pair between -1 and +1. 

text a sentence at a time, using the original language of the Within a sentence a word with multiple concepts categories 

input text as a useful context for selecting the most appro- is disambiguated to the single concept category that is most 

priate sense of the words in a sentence. highly correlated with the Unique or Frequent concept 

If disambiguation is needed (the input word belongs to category. If several Unique or Frequent anchor words exist, 

more than one concept group), then the MCGD will select 60 the ambiguous word is disambiguated to the correct category 

the appropriate concept group using three sources of lin- of the anchor word with the highest overall correlation 

guistic evidence. These are: (a) Local Context, (b) Domain coefficient. 

Knowledge, and (c) Global Information, which are used as The Local and Domain Knowledge evidence sources can 

follows. select a concept group code for each word in the sentence, 

5.2.1 Local Context 65 if at least a single Unique or Frequent concept group code 

If a word in the sentence has been tagged with only one was selected as an "anchor" code for the sentence. But, for 

concept group code, this concept group code is considered words in those sentences for which an "anchor" was not 



05/25/2004, EAST Version: 1.4.1 



6,006,221 

13 14 

found, the third evidence source, Global Knowledge, will (b) Converts the English word members of the selected 

need to be consulted. concept group from the multilingual concept database 

5.23 Global Knowledge (MCD) to zero or more categories in the monolingual 

Global Knowledge simulates the observation made in hierarchical concept dictionary (MHCD). This is a static 

human sense disambiguation that more frequently used 5 mapping scheme, whereby all the English word members of 

senses of words arc cognitively activated in preference to a particular concept group are treated as being equally likely 

less frequently used senses of words. Therefore, the words instantiations. In this static implementation, all English 

not yet disambiguated by Local Context or Domain Knowl- word members of the selected multilingual concept group 

edge will now have their multiple concept group codes are mapped to their respective categories in the MHCD. The 

compared to a Global Knowledge database source, referred io frequencies of the concept categories mapped to by the 

to as the frequency database. The database is an external, English word members of the selected mululingual concept 

off-line sensltagging of parallel corpora with the correc f™Z°< *^ "? ^Tu^t t m0S cate S 0 7 

j c x. • j . , for that word is selected. If there are multiple categories in 

concept group code for each word. The disambiguated me MHCD tQ whkh the WQrd membe | of ^ 

parallel corpora will provide frequencies of each word's mu i tilingual ^^pt group map> men mese multiple cat 

usage as a particular sense (equatable to concept group) in is ries Deed to 5e disam biguated in the next component of the 

the sample corpora. The most frequent sense is selected as system. 

the concept category. ( c ) Maps the many thousand multilingual concept catego- 

The frequency database can be constructed in any of the ries to fewer, higher order monolingual categories, 

following three ways: The MHCD is different from the MCD in that the MHCD 

(1) Collect the most frequent sense information from 20 consists of terms in one language (in the current system, 
partially or fully sense-disambiguated texts (the training data English terms make-up the database). While the MHCD and 
to collect sense frequency information can be built either MCD both define concepts as a groups of synonyms, the 
manually or automatically). Training data can be built auto- MHCD can be characterized by the hierarchical organization 
matically from the output from MCGD module without the which is imposed on the concepts. The hierarchy can be 
frequency database OR the output from automatic sense 25 constructed by relating concepts with relations such as 
comparison using multilingual aligned corpus such as "super/sub type" and "broade^arrower." In the current 
"Canadian Hansard." implementation, the MHCD is a COTS product. 

(2) Have a native language expert select the most com- The output of the MCG-MHCM module is a tagged, 
mon sense of terms. native language text stream with unique, monolingual 

(3) Use frequency information from a lexicon that pro- 30 (English), hierarchical concept categories assigned to each 
vides its senses with frequency information. identified substantive word. 

The multilingual concept group n-gram probability data- 5.4 Monolingual Hierarchical Concept Category Disam- 

base is an optional knowledge database that is constructed biguator (MHCD) 180 

from a training data set. The database contents are derived MHCD 180 accepts the monolingual categories assigned 

from a text corpus analysis of words used in various sup- 35 to substantive worte in a text and performs disambiguation 

ported languages in various contexts. The data in the data- similar to that performed by the multilingual concept group 

base can be either (1) sense-correct concept groups assigned disambiguator (MCGD) module. The disambiguation pro- 

to each term in the texts, or (2) all possible concept groups cess is similar to the disambiguation performed by the 

assigned to each term in the texts (e.g., if one term belongs Subject Field Code (SFQ disambiguator covered in "Natu- 

to three concept groups, then three concept groups will be 40 ral Language ^00658^." 

assigned to that term). The MHCD performs the following processing of text 

This knowledge database collects all concept groups using the following evidence sources: 

which are assigned to N adjacent terms in the texts. The (a) Local Context — The processing here will be nearly 

resulting ordered lists are summed and normalized to pro- identical to the use of local information in MCGD 160 

duce the likelihood probability of the Nth term assigned with 45 described above. That is, Unique or Frequent categories will 

certain concept groups which are assigned to the be determined for each sentence and then used as "anchors" 

(N-l)th, . . . (N-(N-l))th terms. to select one monolingual category from amongst the mul- 

FIG. 3B shows this process where MCGD _160 has.had to tiple monolingual categories to wm^ an ambi^oWmulti" 

-resort" to Domain Knowledge (using the MCGCM) and lingual concept group has mapped. 

Global Knowledge (using the n-gram probability database) 50 (b) Domain Knowledge — The monolingual category cor- 

to disambiguate the polysemous words. relation matrix (MCCM) is used to indicate the probabilities 

The output of MCGD 160 is a single multilingual concept that the multiple monolingual categories to which a multi- 
group for each substantive word in the input text. This lingual concept group has been mapped correlate with the 
concept group may comprise either a single word choice or Unique or Frequent monolingual category determined by 
several word choices, depending on the membership of the 55 local context. The MCCM is produced from a document 
concept group. Words from all supported languages will be corpus, and is similar to the multilingual concept (MCGCM) 
represented. in terms matrix (MCGCM) in terms of bow the two are 

5.3 Multilingual Concept Group to Monolingual Hierar- constructed and their internal structures, 

chical Concept Mapper (MCG-MHCM) 170 (c) Global Knowledge-4f there is no Unique or Frequent 

MCG-MHCM 170 takes as input the fully-tagged, native 60 monolingual category in an input sentence, then the system 

language text stream with single multilingual concept cat- has no "anchor" by which to access the Correlation Matrix 

egories assigned for each substantive word and maps this flat and must use global knowledge. In this event, the frequency 

conceptual representation to an English language hierarchi- of use of various senses of a word is used as the basis for the 

cal representation. MCG-MHCM 170 performs the follow- global knowledge source. 

ing: 65 The output of the MHCD module is a text stream with 
(a) Maps all the native language words in a single concept disambiguated monolingual categories assigned to each sub- 
category to the English word member/s in that category. stantive word. 
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5.5 Monolingual Hierarchical Concept Dictionary-Based monolingual conceptual representation with disambiguated 
Vector Generator (MCVG) 190 . . . concept codes. For simplicity, only the English . language 

MCVG 190 accepts a text stream with single monolingual members of the multilingual concept groups are shown. In 

category assigned to each substantive word in a text, and this example, the complete sentence has "anchor codes" 

produces a fixed-dimension vector representation of the 5 ( e S> "comptant," which maps to code #105, with the 

concept-level contents of the text. The basic processing English member "in cash") that can be used to help disam- 

performed by this module is the same as that performed by biguate other polysemous words in the sentence using Local 

the Subject Field Code (SFQ vector generator described in or Domain Processing. For example, the French "les paie- 

"Natural Language Processing." ments ma P s l ? code ?' wtuch are disambiguated at the 

The MCVG generates a representation of the meaning 10 M £ GD t0 a Fj naaoe cod y - 

/ * -\ r . . e j .i *t. c r By way of background, FIG. 6 shows an example of a 

(context) of the text of a document/query in the form of portion of the processing in a monolingual syster/such as 

monolingual category (subject) codes assigned to informa- {^scribed in "Natural Language Processing " In particular, 

uon bearing words in the text. The monolmgual category HG 6 sfaows the SFC tem for monolingual vector 

vector for aU documents and queries has the same number of representation of the conceptual contents of a document, 

dimensions; weights or scores are applied to each dimension is 6 0 Generation of Term-Based Representations 

according to the presence and frequency of text with certain 61 fr ob3im&i i c Tcrm Indexcr (PTI) 210 

subject-contents. pji 2 io accepts the output from PNC 140 (documents 

The MCVG creates a vector code mdex file for each only ) creates a new appended field in the document 

document to facilitate efficient searching and matching. file pyj also a weighted, TF.IDF score 

Typically, the relative importance of the concept in each 20 ( me product of Term Frequency and Inverse Document 

document and the link between the term and the document Frequency) for each proper noun. This could be applied to 

in which the term occurred is preserved. The vector code otner lypes Q f terms . This weighted score is used in QDM 

index file for each document is a fixed length file containing and combiner 230. This index file contains all proper 

scores/weights for each dimension (called a slot) of the nouns md their TF.IDF scores. 

VC ?,°I™ , „ „ . , . 25 PTI 210 assigns TF.IDF scores for each proper noun as 

MCVG 190 performs the following staged processing: follows: 

(a) The frequencies of the disambiguated monolingual 

category codes assigned to words in the text are summed and TFVDF=Qn{TF)+\yin<fl+\iri) 

then normalized in order to control for the effect of docu- where TF is the number of occurrences of a term within a 

ment length. 30 given document, IDF is the inverse of the number of 

(b) The resulting normalized document vectors are fixed- documents in which the term occurs, compared to the whole 
dimension vectors representing the concept -level contents of corpus, N is the total number of documents in the corpus, 
the processed text (either documents or queries). They are and n is the number of documents in which the term occurs, 
passed to the next module for either document-to-query- The product of TF.IDF provides a quantitative indication of 
vector matching (comparison), or for docu ment- to- 35 a term's relative uniqueness and importance for matching 
document matching (comparison) for clustering of docu- purposes. TF.IDF scores are calculated for documents and 
ments. queries. The IDF scores are based upon the frequency of 

5.6 Concept Mapper and Disambiguator Operation occurrence of terms within a large, representative sample of 
FIGS. 4 and 5 are diagrams showing concrete examples of documents in each supported language. 

the processing of French input text to a monolingual concept 40 The output of the PTI is an index of proper nouns and 

vector. expansions with associated TF.IDF scores. 

FIG. 4 shows the mapping of two substantive French 6.2 Probabilistic Query Processor (PQP) 220 

words, "agricole" and "regime." The word "agricole" can be PQP 220 accepts the native-language query with disam- 

seen to map to a single multilingual concept group with the biguated concept group assignments for each substantive 

English language member "agricultural." As can be seen, 45 word in the query from MCGD 160 and performs the 

this multilingual concept group maps to the monolingual following processing: 

category "Agriculture," and contributes to the monolingual (a) Negation 

category vector, a portion ij)fjwhich^is_shpwn_schematically It is common for aperies to simultaneously "express both" 

- artbe~Tight^ide of Itie figure. items of interest and those items that are not of interest For 

The French word "regime," on the other hand, is 50 example, a query might be phrased "I am interested in A and 

polysemous, and maps to three multilingual concept groups B, but not in C." In this instance, A and B are required (they 

(e.g., concept groups with the English language members are in the "positive" portion of the query) and C is negated 

"reign," "system," and "diet"). The word needs to be dis- and not required (it is in the negative portion of the query), 

ambiguated using the methodology described in the above Only terms in the positive portion of the query are consid- 

discussion of MCGD 160, MCG-MHCM 170 and MHCD 55 ered for document matching. The PQP uses the principles of 

modules, such that an unambiguous, single concept code is text structure analysis and models of discourse to identify 

assigned to the word. In this simple example, since no Local the disjunction between positive and negative portions of a 

Context or Domain Knowledge can be applied to the dis- query. The principles employed to identify the positive/ 

ambiguation process by the word "agricole," (and, for the negative disjunction are based on the general observation 

purposes of this example, we assume no other words help in 60 among discourse linguists that writers are influenced by the 

this disambiguation process), Global Knowledge will be established schema of the text-type they produce, and not 

applied and the most common sense of the word will be just on the specific content they wish to convey. This 

invoked ("system"). established schema can be delineated and used to compu- 

FIG. 5 shows a complete single French sentence as input, tationally instantiate discourse-level structures. In the case 

and shows the two-stage disambiguation explicitly. The 65 of the discourse genre of queries written for online retrieval 

native language sentence is shown being processed through systems, empirical evidence has established several tech- 

thc multilingual concept group generation process, to a niques for locating the positive/negative disjunction. 
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(al) Lexical Clues 

For each supported language there exists a class of ~ " • TABLE -2 - 

frequently used words or phrases that, when connected in a ' Lo^cai outers Used in Sublanguage Processing 

logical sequence, are used to establish the transition from the 

positive to the negative portion of the query (or the reverse). 5 Operator Operation Fuzzy weight/Score 

In English SUCh a sequence might be as simple as "I am ~ Boolean AND Addition of scores within AND operator 

interested in*' followed by ", but not." Que words or phrases OR Boolean OR Maximum score from all ORed terms 

must have a high frequency of occurrence within the con- !NOT Negation — 

fines of a particular context. 1Q ^ ^— — — — 

(a2) Component Ordering Each term in the logical representation is assigned a 

r^r™^T~«t c ;« o i~~A *~ •„ „ weighted score. Scores are normalized such that the maxi- 

Components in a query tend to occur in a certain repetitive ° „ . . , , . ... ,. e M . 

r , 4 , . 3 . . , mum attamable score dunng matchmg (if all terms are 

sequence, and this sequence can be used as a clue to successfully matchcd ^ \ document) ^ m During 

establish negation. matching the fuzzy logical AND operator performs an 

(a3) Continuation Chies addition with all matched ANDed term scores. The fuzzy 

Especially in relatively long queries a useful clue for OR operator selects the highest weighted score from among 

negation disjunction detection across sentence boundaries is aU me etched ORed terms. For example in the query 

conjunctive relations which occur near the beginning of a "P*?**™ of ™ 7 if terms A, C and F are matched, 

J , i_ ■ i_ *_ * . „ . B 6 J . then the score assigned the match would be 0.66 (that is, 

sentence and which have been observed m tests to predict- 20 0 .33 from the match with A, and 0.33 from the match with 

ably indicate possible transitions from sentence to sentence. c> which ^ mc highcr 0 f me 0Rcd c and F weighted scores), 

(b) Construction of Logical Representation of the Query The negation operator (!NOT) divides the query into two 

A tree structure with terms connected by logical operators logical portions: the positive portion of the query contains 

is constructed using a native-language sublanguage proces- aU Positive assertions in the query statement; the negative 

& & & » r 25 porQon 0 f the query contains all the negative assertions in 

the query. No score is assigned to this operation. 

FIG. 7 shows the tree representation of the following The output of the PQP is a logical representation of the 

ouery: query requirements with fuzzy Boolean weights assigned to 

"I am interested in any information about A and B and C, au " terms - 

D or E and 30 70 Matching Documents with Queries 

, „ , t , Documents and queries are processed for matching in 

The latter portion of the query can be represented as: ^ EngUsh ]anguage form , o ^ advantage of ^ m * n0 _ 

A and B and (C or D or (E and F)). lingual processing modules of the DR-LINK information 

The tree structure includes a head term, which can be a retrieval system [Liddy 94a]; [Iiddy 94b]; [Iiddy 95]. 

Boolean AND or OR operator (AND in this case), which 35 Documents are arranged in ranked order according to 

links, possibly through intermediate nodes, to extracted their relative relevance to the substance of a query. The 

query terms at terminal nodes (A, B, C, D, E, and F). The ma, = her 11568 a va ™ * of evidence sources to determine the 

intermediate nodes are also Boolean AND or OR operators. or sultabIe ^atton between query and docu- 

r ments. Various representations of document and query are 

Various lexical clues are used to determine the logical 40 used f or matching, and each document-query pair is 

form of the query. The basis of this system is a sublanguage assigned a match score based on (1) the distance between 

grammar which is based on probabilistic generalizations vectors, and (2) the frequency and occurrence of proper 

regarding the regularities exhibited in a large corpus of nouns. 

query statements. The sublanguage relies on items such as The fact that the documents are represented in a common, 

function words (the placement of articles, auxiliaries and 45 language-independent vector format of weigh ted slot values, 

prepositions), meta-text phrases, and punctuation (or the no matter what the language of the individual documents, 

combination of these elements) to recognize and extract the enables the system to treat all documents similarly, 

formal logical combination ofreleyancy requirements from Therefore,-it can:-(l) cluster docum 

_ mTqu&ry7The sublanguage interprets the query into pattern- amongst them, and (2) provide a single list of documents 

action rules which reveal the combination of relations that 50 ranked by relevancy, with documents of various languages 

organize a discourse, and which allow the creation from interfiled. Thus the process whereby documents are 

each sentence of a first-order logic assertion, reflecting the retrieved and ranked for review by the user is language 

Boolean assertions in the text independent. 

. , , , . 7.1 Monolingual Category Vector Matcher (MCVM) 200 

Part of this sublanguage is a limited an aphor resolution 55 MCVM 200 ^ similaf iQ the Subject Reld (sfq 

(that is, the recognition of a grammatical substitute, such as matc her described in "Natural Language Processing." 

a pronoun or pro-verb, that refers back to a preceding word The process of document to query matching using the 

or group of words). An example of a simple anaphoric monolingual category vector is: 

reference is shown below: ( a ) Generation of the monolingual category vector for 

"I am interested in the stock market performance of IBM. 60 and document (see earlier discussion and FIGS. 3A 

I am also interested in the company's largest foreign ancl 

shareholders." W Generation of &tance/proximity measures. The vec- 

. , . ' , . . . tor for each text is normalized in order to control for the 

In this example, the phrase "the company s is an anaphoric eflfecl of documenl leDgm . nt vector codes can be consid- 

reference back to "IBM. 65 ered a special form 0 f controlled vocabulary (all words and 

A summary of the fuzzy Boolean operators and their terms are reduced to a finite number of vector codes). A 

function is shown in Table 2, below. similarity measure of the association or correlation of the 



05/25/2004, EAST Version: 1.4.1 



6,01 

19 

query and document vectors is assigned by simulating the 
distance/proximity of the - respective vectors in multi- 
dimensional space using similarity measure algorithms. 

7.2 Query to Document Matcher (QDM) and Score Com- 
biner 230 QDM and score combiner 230 accepts three input 
streams: the TF.IDF scores for documents from the docu- 
ment index created by PTI 210; the logical query represen- 
tation from PQP 220; and the vector representation of both 
document and query from the MCVM 200. The output of the 
QDM and score combiner module is a score representing the 
match between documents and query. 

Using the evidence sources listed above, the matcher 
determines the similarity or suitable association between the 
query and the documents. Various representations of docu- 
ment and query are used for matching. Each document- 
query pair is assigned a series of match scores based on (1) 
the common occurrence of proper nouns or expansions in 
the logical query representation, (2) TF.IDF scores, and (3) 
the distance between vectors. 

Documents are assigned scores using the following evi- 
dence: 

(a) Monolingual Category Vectors. The proximity of the 
vector for query and document. 

(b) Positive TRIDF (TF.IDF for the positive portion of the 
query). Matching is based on a natural-log form of the 
equation TF.IDF, where TF is the number of occurrences of 
a term within a given document, and IDF is the inverse of 
the number of documents in which the term occurs, com- 
pared to the whole corpus (see description of PTI 210). The 
scores are normalized to the highest TF.IDF score for all 
documents. 

(c) Query match. The matching of proper nouns (or other 
terms) and expansions scored from the logical query repre- 
sentation. 

7.3 Document Scores 

A logistic regression analysis using a Goodness of Fit 
model is applied to compute a relevance score for each 
document. Three independent variables, corresponding to 
the three types of evidence mentioned above, are used. 

Regression coefficients for each variable in the regression 
equation are calculated using an extensive, representative, 
multilingual test corpus of documents for which relevance 
assignments to a range of queries have been established by 
human judges. 

The logistic probability (logprob) of a given event is 
calculated as follows: 

logprob (event)=l/(l-w^) 

where Z is the linear combination 

Z-B 0 +B X B X t-B-JC^BJCs 

and Bj_3 are the regression coefficients for the independent 
variables Xj_ 3 . Documents are ranked by their logistic 
probability values, and output with their scores. 
8.0 Presentation of Results 
8.1 Recall Predictor 240 

The matching of documents to a query organizes docu- 
ments by matching scores in a ranked list. The total number 
of presented documents can be selected by the user or the 
system can determine a number using the Recall Predictor 
(RP) function. Note that documents from different sources 
are interfiled and ranked in a single list. 

The RP filtering function is accomplished by means of a 
multiple regression formula that successfully predicts cut- 
off criteria for individual queries based on the similarity of 
documents to queries as indicated by the vector matching 
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(and preferably the proper noun matching) scores. The RP is 
sensitive to. the varied distributions of similarity scores (or 
match scores) for different queries, and is able to present to 
the user a certain limited percentage of the upper range of 

5 scored documents with a high probability that close to 100% 
recall will be achieved. The user is asked for the desired 
level of recall (up to 100%), and a confidence interval on the 
retrieval. While in some cases a relatively large portion of 
the retrieved documents would have to be displayed, in most 

10 cases for 100% recall with a 95% confidence interval less 
than 20% of the retrieved document collection need be 
displayed. In trials of the DR-LINK system (level of recall 
100%, confidence level 95%), the system has collected an 
average of 97% of all documents judged relevant for a given 

15 query [Liddy 94b]. 

8.2 Graphical User Interface (GUI) 250 

GUI 250 uses clustering techniques to display 
conceptually-similar documents. The GUI also allows users 
to interact with the system by invoking relevance feedback, 

20 whereby a selection of documents or a single document can 
be used as the basis for a reformulated query to find those 
documents with conceptually similar contents. 

The GUI for the CINDOR system is specifically intended 
to be suitable for users of any nationality, even if their 

25 knowledge of foreign languages is sparse. Graphic repre- 
sentations of documents will be used, with textual/ 
descriptive representations kept to a minimum. Research has 
shown that the factors that influence comprehension of new 
data are (1) the rate at which information is presented, (2) the 

30 complexity of the information, and (3) how meaningful the 
new information is. Highly meaningful information is 
accepted with relative ease; less meaningful information, in 
addition to being less useful, requires greater cognitive effort 
to comprehend (and usually reject). Coherence of presenta- 

35 tion and an association with existing knowledge are both 
highly correlated with increased meaningfulness. Thus the 
concept behind the user interface is to present "details on 
demand," showing only enough information to allow quick 
apprehension of relevance: more details are immediately 

40 available though hypertext links. 

8.3 Document Clustering, Browsing and Relevance Feed- 
back 

The monolingual category vectors are used as the basis 
for the clustering and display, and for the implementation of 
45 relevance feedback in the system: 
8.3.1 Clustering 

Documents can be clustered using an agglomerative 

(hierarchical) algorithm that compares all document'vectors" 

and creates clusters of documents with similarly weighted 
50 vectors. The nearest neighbor/Ward's approach is used to 
determine clusters, thus not forcing uniform sized clusters, 
and allowing new clusters to emerge when documents 
reflecting new subject areas are added. These agglomerative 
techniques, or divisive techniques, are appropriate because 
55 they do not require the imposition of a fixed number of 
clusters. 

Using the clustering algorithm described above, or other 
algorithms such as single link or nearest neighbor, CINDOR 
is capable of mining large data sets and extracting highly 

60 relevant documents arranged as conceptually-related clus- 
ters in which documents from several languages co-occur. 

Headlines from newspaper articles or titles from docu- 
ments in the cluster are used to form labels for clusters. 
Headlines or titles are selected from documents that are near 

65 the centroid of a particular cluster, and are therefore highly 
representative of the cluster's document contents. An alter- 
native labeling scheme, selectable by the user, is the use of 
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the labeled subject codes which make up either the ceatroid An example of the transliteration system output is shown 

document's vector or the cluster vector. . below: 
The user is able to browse the documents, freely moving 

from cluster to cluster with the ability to view the full ■ 

documents in addition to their summary representation. The 5 ^ h ° riginal Lcs Surplus et lei chutes des prix agricole entrainent 

user is able to indicate those documents deemed most cindor Trans- rise fall price agricultural bring about 

relevant by highlighting document titles or summaries. If the literation: 

user so decides, the relevance feedback steps can be imple- English The rise and fall of agricultural prices drives 

mented and an "informed" query can be produced, as I"" 18 ! 01 !?* . , 

..... French Original des mouvemcnts sur les marches. La fautc a qui? . . . 

discussed below. 10 Tcxt . 

The CINDOR system is thus able to display a Series of CINDOR Trans- movements markets, fault who? . . . 

conceptually-related clusters in response to a browsing literation: 

query. Each cluster, or a series of clusters, could be used as Testation- movements fau » markets, whose fault is it? . 

a point of departure for further browsing. Documents indica- \ m 

live of a cluster's thematic and conceptual content would be 15 

used to generate future queries, thereby incorporating rel- ° nl y some of the words will be mapped into 

evance feedback into the browsing process. The facility for corresponding, disambiguated words or phrases in another 

browsing smaller, semantically similar sub-collections language. Much of the text in a document, especially the 

which contain documents of multiple languages aids users in functional classes of words, will remain un-transliterated. 

determining which documents they might choose to have 20 one ° f me *™&* of approach is that the 

translated laborious and expensive process of translating a great many 

q io r*«^i™ «i n f nm oA» n„»; A o t„ d^., . foreign documents to ascertain relevance can be avoided. 

FeSLk Wilh CINDOR, only those few documents that obtain a high 

„ . - „ . . .. . - . . . . . relevance ranking and show promise in their transliterated 

Relevance feedback is accomplished by combining the form become candidates for mil translation, if desired. The 

vectors of user-selected documents or document clusters 25 nlu&m of words could be based on (1) whether they have 

with the original query vector to produce a new, "informed" been indexed m me MCDj (2 ) their POS-tag assignment, (3) 

query vector. The "informed" query vector will be matched anaphoric disambiguation, and (4) meta-textual and 

against all document vectors in the corpus or those that have discourse-level considerations, such as whether words and 

already passed the cut-off filter. Relevant documents will be phrases are in the headline of a text, 

re-ranked and re-clustered. 30 8.5 Machine Translation of Relevant Documents 

1. Combining of Vectors. The vector for the original query Documents or document clusters that, based on their high 
and all user-selected documents are weighted and combined relevance ranking, the gloss transliteration, or other factors, 
to form a new, single vector for re-ranking and re-Clustering. are deemed to be highly relevant to a query, and are 

2. Re-Matching and Ranking of Corpus Documents with candidates for a machine translation of the original foreign 
New, "Informed" Query Vector. Using the same similarity 35 language text. CINDOR thus ensures that only those few 
measures described above for MCVM 200, the "informed" documents that are especially pertinent to a query will 
query vector is compared to the set of vectors of all undergo the full translation process. 

documents above the cut-off criterion produced by the initial CINDOR incorporates a range of computer aided trans- 
query (or for the whole corpus, as desired), then a revised lation modules, each a COTS technology, that translate a 
query-to-document concept similarity score is produced for 40 given document from one language to another. The selection 
each document. These similarity scores are the system's of the appropriate COTS module is automatic, being based 
revised estimation of a document's predicted relevance. The on the language identification assignment for each document 
set of documents are thus re-ranked in order of decreasing provided by LI 120 and on the identified language of the 
similarity of each document's revised predicted relevance to query. For any given query and range of documents it is 
the "informed" query on the basis of revised similarity 4S likely that multiple translation modules will be activated, 
value. Each machine translation COTS module, or MT engine, 

3. Cut-Off and Clustering after Relevance Feedback. will process source documents to create a given translation 
JUsing the_ same_regression - formula -described-above -in without-human" interventionTof aid . In caseT^vhere - the - 

connection with recall predictor 240, a revised similarity document contains arcane or industry-specific terminology, 

score cut-off criterion is determined by the system on the 50 such as with medical or legal documents, multilingual 

basis of the "informed" query. The regression criteria are the mapping terminology managers with objects stored in a 

same as for the original query, except that only the vector conceptual orientation may also be invoked to aid the 

similarity score is considered. The agglomerative translation process, 

(hierarchical) clustering algorithm is applied to the vectors 9.0 References 

of the documents above the revised cut-off criterion and a 55 [Liddy94a] liddy, E. D. & Myaeng, S. H. (1994). DR-LINK 

re-clustering of the documents will be performed. Given the System: Phase I Summary. Proceedings of the TIPSTER 

re-application of the cut-off criterion, the number of docu- Phase I Final Report. 

ment vectors being clustered will be reduced, and improved [Liddy94b] Liddy, E. D., Paik, W., Yu, E. S. & McKenna, M. 

clustering is achieved. (1994). Document retrieval using linguistic knowledge. 

8.4 Application of "Gloss" Transliteration to Highly Rel- 60 Proceedings of RIAO '94 Conference. 

evant Documents [Liddy95] Uddy, E. D., Paik, W., McKenna, M. & \u, E. S. 

Conceptual-level matching and disambiguation of words (1995). A natural language text retrieval system with 

ensures that when these words are translated, the correct relevance feedback. Proceedings of the 16th National 

sense or meaning will be selected. It is therefore possible to Online Meeting. 

offer a surface-level transliteration of highly relevant docu- 65 10.0 Conclusion 

ments with a very high degree of certainty that the correct In conclusion, it can be seen that the present invention 

translation of words will be performed. provides an elegant and efficient tool for multilingual docu- 
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ment retrieval. The system permits even those searchers with 
limited or no knowledge of foreign languages to gather 
highly relevant information from international sources. 
Since the system offers a "gloss" transliteration of target 
texts, the user is able to ascertain relevance of foreign- 
language texts so as to be able to make an intelligent 
decision regarding full translation. 

While the above is a complete description of specific 
embodiments of the invention, various modifications, alter- 
native constructions, and equivalents may be used. For 
example, while the specific embodiment augments concept 
level matching through the use of term-based representa- 
tions and matching, it is possible to implement an embodi- 
ment using concept level matching alone. Additionally, 
evidence combination criteria could be modified for differ- 
ent retrieval criteria. For example, some specific terms or 
some specific concept categories may be considered man- 
datory for matching, such that matching would be a two-step 
process of foldering based on logical requirements, and 
within folders regression-based matching scores would be 
used. 

Similarly, while the described disambiguation method is 
the presently preferred method, there are other possibilities, 
such as statistical or entirely probabilistic techniques. 
Indeed, disambiguation of concept codes, while preferred, is 
not essential. Moreover, the concept vector categories, 
codes, and hierarchy could be modified or expanded, as 
could the proper noun categories, codes, and hierarchy. 

Another language-independent method of representing 
text is using n-gram coding, wherein a text is decomposed 
to a sequence of character strings, where each string contains 
n adjacent characters from the text. This can be done by 
moving an n-character window n characters at a time, or by 
moving the n-character window one character at a time. In 
an n-gram representation, no attempt is made to understand, 
interpret or otherwise catalog the meaning of the text, or the 
words that make up the text. A tri-gram representation is the 
special case where n»3. Representation and matching are 
based on the co-occurrence of n-grams or a sequence of 
character strings, or on the co-occurrence and relative preva- 
lence of such n-grams, or on other, similar schemes. Such 
analysis is an alternative representational scheme for CIN- 
DOR. 

In this alternative embodiment, an n-gram query proces- 
sor (NQP) module replaces probabilistic query processor 
(PQP) 220, an n-gram document processor replaces proba- 
bilistic term Indexer (PTI) 210, and an n-gram query to 
document matcher replaces_query to document_matcher_ 
(QDM 230)7The NQP accepts the native-language input and 
performs the following processing: a) decomposes each term 
in the queries into n-adjacent-character strings; and b) lists 
each unique n-adjacent-character string with the number of 
occurrences as the document representation. The NDP 
accepts the output from PNC 140 and performs the follow- 
ing processing: a) decomposes each term in the document 
into n-adjacent-character strings; and b) lists each unique 
n-adjacent-character string with the number of occurrences 
as the query representation. The NQDM accepts two input 
streams, namely the outputs from the NQP and NDP, and 
provides a score representing the match between the docu- 
ments and query. This output is an input to the score 
combiner. Documents are assigned scores by measuring the 
degrees of overlap between the n-gram decomposed terms 
from documents and queries. The larger the overlap, the 
higher the degree of relevance. 

Therefore, the above description should not be taken as 
limiting the scope of the invention as defined by the claims. 
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What is claimed is: 
. . . 1. A method of representing documents in a database that 
includes documents in a plurality of languages, the method 
carried out for each document, comprising: 
5 determining a set of potential conceptual-level meanings 
of at least some words in the document from a 
language-independent multilingual concept database 
that reflects the plurality of languages and comprises a 
collection of concept groups; 
10 mapping the sets of potential conceptual-level meanings, 
so determined, to respective single language- 
independent conceptual-level meanings; and 
generating a language-independent conceptual represen- 
tation of the subject content of the document based on 
15 the language-independent conceptual-level meanings 
determined in said mapping step. 
2. The method of claim 1, and further comprising, deter- 
mining the language of the document for at least some 
documents. 

20 3. The method of claim 1, and further comprising 

generating a term-based representation of the document 
for at least some documents. 

4. The method of claim 3 wherein the term-based repre- 
sentation of the document is a representation of a set of 

25 proper nouns found in the document. 

5. The method of claim 4 wherein the set of proper nouns 
found in the document are represented as categories from a 
hierarchical classification scheme. 

6. The method of claim 3 wherein the term-based repre- 
30 sentation of the document is a representation of a set of noun 

phrases found in the document. 

7. The method of claim 1 wherein a given one of said 
words is polysemous, giving rise to multiple conceptual- 
level meanings from the multilingual concept database, and 

35 said mapping, for the given word, comprises: 

disambiguating among the multiple conceptual-level 
meanings from the multilingual concept database to 
provide a single multibngual conceptual-level mean- 
ing; 

40 . . 

mapping the single multilingual conceptual- level mean- 
ing to a set of monolingual concept categories in a 
monolingual concept dictionary; and 

if the set of monolingual concept categories from the 
45 monolingual concept dictionary contains multiple 
monolingual concept categories, disambiguating 
among the multiple monolingual concept categories to 

_ pjpyjde_the single, language-independent conceptual- 
level meaning. 

50 8. The method of claim 7 wherein disambiguating 
includes: 

analyzing local context information to attempt to deter- 
mine a single meaning; 

if a single meaning is not determined from analyzing local 
55 context information, analyzing domain knowledge to 
attempt to determine a single meaning; and 

if a single meaning is not determined from analyzing 
domain knowledge, analyzing global information to 
attempt to determine a single meaning. 
60 9. The method of claim 7, and further comprising 

providing a gloss transliteration using the single multi- 
lingual conceptual-level meaning derived from disam- 
biguating among the multiple conceptual-level mean- 
ings. 

65 10. The method of claim 1 wherein the collection of 
concept groups includes words or phrases, from the plurality 
of languages, that are conceptually synonymous. 
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11. The method of claim 1, and further comprising: 
determining a measure of proximity of the language- 
independent conceptual representation of each docu- 
ment to the language-independent conceptual represen- 
tation of the other documents in the plurality; and 

clustering the documents in the plurality according to the 
documents' respective measures of proximity to each 
other. 

12. A method of retrieving documents in response to a 
query, the query being in a user-selected language of a 
plurality of languages, the method comprising: 

providing a corpus of documents, each in a language of 
said plurality of languages, at least one of the docu- 
ments being in a language other than the user-selected 
language; 
for each document; 
determining a set of multilingual concepts of at least 
some words in the document using a language- 
independent multilingual concept database that 
reflects the plurality of languages and comprises a 
collection of concept groups; 
mapping the sets of multilingual concepts, so 
determined, to respective single language- 
independent conceptual-level meanings; and 
generating a language-independent conceptual repre- 
sentation of the subject content of the document 
based on the language-independent conceptual-level 
meanings determined in said mapping; 
generating a language-independent conceptual represen- 
tation of the subject content of the query; and 
for each document, generating a measure of relevance of 
the document to the query using the conceptual repre- 
sentation of the subject content of the document and the 
conceptual representation of the subject content of the 
query. 

13. The method of claim 12 wherein the query is a natural 
language query. 

14. The method of claim 12 wherein generating a 
language-independent conceptual representation of the sub- 
ject content of the document comprises: 

generating a conceptual-level vector representing the sub- 
ject content of the document. 

15. The method of claim 12 wherein; 

the concept groups include a collection of synonyms and 
near-synonyms of a given word or phrase in said 
plurality of languages; 

a givenjwojxjjiUhedo^ 

to multiple conceptual-level meanings from the multi- 
lingual concept database; and 

for the given word, mapping the sets of multilingual 
concepts to respective single language -independent 
conceptual-level meanings comprises: 
disambiguating the multiple conceptual-level mean- 
ings. 

16. The method of claim 15 wherein disambiguating the 
multiple conceptual-level meanings comprises: 

disambiguating among the multiple conceptual-level 
meanings from the multilingual concept database to 
provide a single multilingual conceptual-level mean- 
ing; 

mapping the single multilingual conceptual-level mean- 
ing to a set of monolingual concept categories in a 
monolingual concept dictionary; and 

if the set of monolingual concept categories from the 
monolingual concept dictionary contains multiple 
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monolingual concept categories, disambiguating 
.... among the. multiple monolingual concept categories to 
provide the single language-independent conceptual- 
level meaning. 

5 17. The method of claim 16 wherein disambiguating 
includes: 

analyzing local context information to attempt to deter- 
mine a single meaning; 

if a single meaning is not determined from analyzing local 
10 context information, analyzing domain knowledge to 
attempt to determine a single meaning; and 

if a single meaning is not determined from analyzing 
domain knowledge, analyzing global information to 
attempt to determine a single meaning. 
15 18. The method of claim 16, and further comprising 

providing a gloss transliteration using the single multi- 
lingual conceptual-level meaning derived in disam- 
biguating among the multiple conceptual-level mean- 
ings. 

19. The method of claim 12, and further comprising 
providing a gloss transliteration of at least some of the words 
in at least one of the documents. 

20. The method of claim 12 wherein said language- 
independent conceptual representation of the subject content 
of the document is augmented by a language-dependent 

25 statistical index using words in the document's language as 
indexing units. 

21. The method of claim 12 wherein said language- 
independent conceptual representation of the subject content 
of the document includes a statistical index using N-gram 

30 style decomposed words as indexing units. 

22. The method of claim 12 wherein generating a 
language-independent conceptual representation of the sub- 
ject content of the query comprises: 

mapping words or phrases in the query into language- 
35 independent concepts; and 

generating a conceptual -level vector representing the sub- 
ject content of the query. 

23. The method of claim 12 wherein said language- 
independent conceptual representation of the subject content 

40 of the query includes N-gram style decomposed terms as 
language-independent query requirements. 

24. The method of claim 12 wherein said language- 
independent conceptual representation of the subject content 
of the query is augmented using a language-dependent logic 

45 requirement, the logic requirement including terms and 
logical connectives where the terms include the query term 
and its synonymous terms in the multilingual concept data- 
base. — 

25. "The method of claim 12, and further comprising: 
50 providing a list of at least some of the documents; 

receiving user input specifying at least one document, or 

a part thereof, on the list; and 
generating a revised query representation based on the 

original query plus a representation of the specified 
55 document or documents, or parts thereof. 

26. The method of claim 12, and further providing a 
relevance-ranked list of at least some of the documents. 

27. The method of claim 26, wherein the number of 
documents in the relevance-ranked list of documents is 

60 calculated based on a user-specified level of recall. 

28. The method of claim 26, wherein the number of 
documents in the relevance-ranked list of documents is 
calculated based on a user-specified level of recall and a 
user-specified level of confidence in that level of recall. 

65 29. The method of claim 26, wherein retrieved documents 
in the relevance-ranked list of documents are ranked without 
regard to the language they are written in. 
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30. The method of claim 12, and further determining the 
language of the document before generating a language- 
independent representation of the document. 

31. The method of claim 12 wherein generating a measure 
of relevance for a given document comprises: 

generating conceptual-level vectors for the given docu- 
ment and for the query; and 

determining a distance between the vectors, the distance 
representing the measure of relevance, with a smaller 
distance representing a higher degree of relevance. 

32. The method of claim 12 wherein generating a measure 
of relevance for a given document comprises: 

generating an N-gram decomposed term representation 
for the given document and for the query; and 

determining a degree of overlap between the N-gram 
decomposed terms, the overlap representing the mea- 
sure of relevance, with a larger overlap representing a 
higher degree of relevance. 

33. The method of claim 12 wherein generating a measure 
of relevance for a given document comprises: 

generating word representations for the given document 

and for the query; 
organizing words in the query as logical requirements; 

and 

determining a coverage of terms in the documents against 
the logical requirement of a query, the coverage rep- 
resenting the measure of relevance, with a larger cov- 
erage representing a higher degree of relevance. 

34. A method of retrieving documents in response to a 
query in a user-selected language of a plurality of languages, 
the method comprising: 
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(a) providing a corpus of documents, each in a language 
of said plurality of languages, at least some of the 
documents being in a language other than the user- 
selected language; 

(b) processing each document by 
determining the language of the document, 
determining conceptual-level meaning of at least some 

words in the document from a language-independent 
10 multilingual concept database comprising a collec- 

tion of concept groups, 
mapping the conceptual -level meanings into language- 
independent concepts, and 
is generating a conceptual-level vector representing the 
subject content of the document; 

(c) processing the query by 

mapping words or phrases in the query into language- 
20 independent concepts, and 

generating a conceptual-level vector representing the 
subject content of the query; and 

(d) for each document, determining a measure of rel- 
25 evance to the query. 

35. The method of claim 34 wherein mapping the 
conceptual-level meanings into language-independent con- 
cepts comprises 

disambiguating multiple senses of polysemous words and 
30 phrases to generate the language-independent concepts. 

* * * * * 
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ABSTRACT 



A integrated method for searching and reporting the search 
of electronic data files by rec eiving a plurality of first and 
sec ond searc h con cepts from the user , forming the first and 
sec ond concepts into two - dimensional matrix of^p aked 
concepts^ performi ng a search of ope or more databa ses 
base d on all concepts and paired concepts in the matrix , and 
i dentrf yin^nd^gjg pjaying a corresponding matrix of search 
re sults. An integrated search collection provides formatted 
documents for drag and drop collection of search informa- 
tion an construction of a search library. An integrated report 
generation utilizes the format of the collection document for 
automatic construction of a report. 
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METHOD AND APPARATUS FOR 
ELECTRONIC FILE SEARCH AND 
COLLECTION 

U.S. Provisional Application Ser. No. 60/210,482, filed 5 
Jun. 9, 2000, is hereby incorporated by reference. 

BACKGROUND OF THE INVENTION 

1. Field of the Invention 

The present invention is directed to a method for search- 
ing electronic data files and, more particularly, to a method 
including the entering of a two-dimensional array of search 
concepts, each concept being predefined key words and 
expressions or user-defined key words and expressions, and 
detecting and displaying a correlation of occurrence, within 
the electronic data files, between entered concepts in the 
respective dimensions. 

2. Related Art 

The amount of information generated, collected, stored, 
communicated and accessible through the electronic media 20 
is continuing to increase. The increase is not only in the 
volume; it is in the number of sources, and the variety of 
formats in which the information is communicated and 
stored. The sources include newspapers, technical journals, 
government publications, literary works, laws, court 25 
opinions, business reports, and public records. More and 
more of these are being generated, stored, searched, 
retrieved, and distributed through networked systems of 
digital computers and other digital document generation and 
management devices. The migration of these and other 30 
sources, and large archives of the same, to electronic media 
is generally attributed to a combination of the Internet and 
the increasing number of and capabilities of personal com- 
puters (PCs) and other Internet access devices. 

The average operator-user with an entry-level PC, a 35 
telephone line, and a subscription to an Internet Service 
Provider (ISP), such as America On Line®, now has access 
to literally billions of documents, forms, images, and text 
files, stored throughout the world on a myriad of databases. 
A large number of the databases are available as free access, 40 
to anyone, while others are subscription based or otherwise 
limited access. There are large databases which, although 
not directly accessible through the World Wide Web, are 
available through controlled-access wide area networks 
(WANs). As known to persons skilled in the relevant art, 45 
these maybe physically separate from the Internet or may be 
Virtual Private Networks (VPNs) which coexist on the 
Internet with public data traffic. Through such private net- 
works an authorized person may have access to large pro- 
^prietary~databa^es~of te^hnical'jbura 50 
medical records, criminal records, internal memoranda, 
business reports and the like. 

There are continuing problems, though, with searching 
such a large number of electronic files. Many of these 
problems prevent users from fully exploiting the Internet, 55 
and other wide area networks, and the many databases which 
these networks make available for their use. One of the 
problems is the formulation of a search strategy. Search 
strategy includes the choice of particular features that the 
user believes, or has otherwise determined, would be con- 60 
tained in, described by, or descriptive of the electronic files 
relating to the topic that he or she is researching. The 
choosing of these search features is critical to the research 
task, yet in most cases it is carried out using nothing more 
than intuition, trial and error. 65 

Stated more particularly, a typical search of the World 
Wide Web is as follows: A user accesses the Internet 
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through, for example, an Internet Service Provider such as 
America On Line®. The user then, using computer software 
features that are well known in the art, enables a web 
browser program that resides on his or her personal 
computer, such as, for example, Microsoft Explorer® or 
Netscape Navigator®. As is well known in the art, the web 
browser is usually programmed with a default "home page", 
which is the Universal Resource Locator ("URL") of a 
specific web site. The web browser then performs the 
required Hypertext Transfer Protocol ("HTTP") communi- 
cations with the web server hosting the home page. 

The home page may be hosted by a commercial web 
services/advertising entity, such as Microsoft Network®, 
Excite®, and Yahoo®. Such commercial home pages gen- 
erally have one or more icons representing search engines, 
both their own and those of third parties such as Lycos® and 
Infobot®. When the user clicks on the search engine, he or 
she is presented with a display page typically having a field 
for entering the search query terms, also referenced in the art 
as "key words". 

The typical user then proceeds to enter the key words. 
Many commercially available Internet search engines pro- 
vide Boolean connectors of AND, OR and NOT for con- 
necting the key words. Boolean searching ideally identifies 
all documents containing the defined connection of string of 
"key words". This may be with or without further 
limitations, such as year, language, publisher, and other type 
characteristics. Some of the sophisticated Boolean search 
methods permit the user to define search terms to include not 
only the term itself, but also the synonyms of, and the ranges 
around the term. There are available search engines that 
have the ability to group key words according to parenthesis. 
This permits more complex Boolean expressions. 

The entry field, though, forms the key words into a 
one-line expression, regardless of the number of terms. 
Therefore, in that one line expression, the user is attempting 
to formulate a single Boolean expression that will, based 
only on his or her intuitive sense, have a "feels OK" 
likelihood of finding relevant files, i.e., "hits", but is not so 
broad that it retrieves an unwieldy number. 

In a typical scenario of Boolean searching, however, the 
user would not simply formulate a single expression, and 
then conduct the entire search using only that expression. 
Instead, the process is typically as follows: The user 
attempts a first Boolean expression and gets a number of 
"hits". If the number of hits is zero the user will usually vary 
the expression, either by removing one of the AND operators 
and thus lowering the crit eria re quired for a doc ume nt to 
qualify as a hit, or by substituting a synonym for ooe or more 
of the search terms. If the number is too high the user may 
retrieve, by one of the known methods, a sample set of the 
"hits" and read them to identify his or her next strategy. Most 
often the user will simply add further search criteria, typi- 
cally by connecting another key word to the original Bool- 
ean phrase by an AND operator, and then run another search. 
When the process is completed, which is frequently coinci- 
dent with the point where the user runs out of time, the 
typical user will have attempted a generally random 
sequence of different Boolean expressions, and many varia- 
tions on each. The user has, hopefully at least, laboriously 
retrieved and reviewed documents obtained from each 
search expression and, in a method that is typically unique 
to each user, has collected and combined these into, for 
example, a research report. 

There are numerous problems with this method. One 
major problem is that the user is attempting to find an 
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optimal search phrase, using the number of "hits'* resulting and other electronic files by receiving a plurality of search 

from each attempt compared to the previous attempt as the . . concepts from the user, designating a first plurality of the 

sole heuristic. For example, assume that a user is writing a search concepts as a first search vector defining a first 

paper on trends in the number of children who are trans- dimension of the matrix, and designating a second plurality 

ported to and from school by busses as compared to the 5 of the search concepts as a second search vector defining a 

number who are transported by parents or guardians. second dimension of the matrix. The method then performs 

Assume that the first Boolean phrase that the person uses is a search of one or more databases based on the matrix, and 

the previous example of (CHILD OR KIDS) AND (BUS OR identifies a plurality of search results, each represented by a 

("PUBLIC TRANSPORTATION")). Assume that the user is cell of the matrix. A row of the matrix is formed by a row 

searching the Internet, using known methods of Internet of cells reflecting, on a one to one basis, a search result for 

access. If the number of hits is too high the user will add each of the plurality of search concepts within the first 

another search term. An example would be PERCENTAGE search vector. A column of the matrix is formed by a column 

TRANSPORTED. The typical user would then run the of cells reflecting, on a one to one basis, a search result for 

search again and see the number of hits. After a number of each of the plurality of search concepts within the second 

iterations the user would finally obtain an acceptable number search vector. Other cells of the matrix reflect, on a one to 

of nils, for example thirty. 15 one basis, a search result for each unique pair comprising a 

The search "methodology" described above has other search concept from among said first plurality of search 

shortcomings. One is that the user might not record the concepts and a search concept from among said second 

various search Boolean phrases that were attempted before plurality of search concepts. 

he or she finds the phrase that yields the desired thirty hits. A embodimeilt of the invcntion presents the user 
As a result the user might run the same search twice, or 20 ^ a yisu ^ arraDgirig the first plurality of search 
might forget to try all possible subshtuUons of terms. ^ a ^% nd mc ^ oaad J luraKty of 
Another problem, which is more fundamental, is that the / . , , ' , n .... r - , ' 
search phrase that the user ended up with might not be the ^ """P* as a h ^ x row - Each cell within the border 
only search phrase that obtains thirty hits, and, of those 15 m a row-column position corresponding to a pair of search 
phrases, it might not be the best one. „ "Hrcephi, one being from the first plurality of search con- 
Still another problem, which overlays all of the previously «** a f ° ne bei°g fro°> the second plurality of search 
identified problems, is that some users are better than others ^n.^ P li * ^ re ^ ltS 
at formulating search expressions. Uris creates a statistical e ,f ch haVe a v»»I sttte reflecting the search result for 
variance in the "quality" of searches, both in terms of time search concept or pa.r of search concepts corresponding 
and coverage, which may itself be a problem, especially 30 to at 06 ' 

within certain institutions and professions. AstiU further embodiment of the invention mcludes a step 
Another problem with a "methodology" for Boolean ° f ^3^8 the matrix of cells to appear as a two- 
searching such as the example above is thafthe user may not d^ensional plane, and delaying the search results to 
have fully defined or developed the topic of the paper before appear as a intension. 

starting the research. As is well known among, for example, 35 A t*™ 1 ™ embodiment of the invention includes a step of 

college students, the user frequently starts the search before receiving a search concept definition command from a user, 

fully identifying the topic, scope, or conclusion of the task and ^ onc or morc of mc of search concepts 

for which the search is being conducted. The user then picks m accordance with the received search concept definition 

the topic, and composes the outline of the paper, or other command. 

reporting document, after sifting through the results 40 Another embodiment of the invention may be combined 

obtained from his or her repeated searches with different a °y of thc previously identified embodiments, and 

Boolean expressions. However, in using the "trial and error'* comprises the further step of receiving a user-entered cell 

method of attempting numerous Boolean expressions to see selection command, presenting the user with a cell result list 

which one provides results that inspire the user, the user may identifying documents and other electronic files within the 

frequently overlook many Boolean expressions for which 45 search results reflected by the selected cell. This embodi- 

the search results would reveal more interesting or valuable ment optionally includes a further step of receiving a docu- 

topics. ment selection command from the user and a step of 

Yet another problem with the prior art of searching using displaying information reflecting information content of a 
-single-line Boolean expressions is -that-many users cannot- _*ainMmtorpthef elec tronic ffl ejglected ui accprdance ^with 

easily generate or store an understandable description of, or 50 *f document selection command. This embodiment option- 

history of, the overall search strategies that were employed ^ mclude f a fcaturc of simultaneously displaying 

when he orshe conducted a search. Therefore, frequently the * c rc( * ived cell selection command, thc cell result list, a 

user will run what is basically the same search twice, or will data reflecting the document selection command, and the 

recreate the search strategy each time a particular project is ^formation reflecting information content, 

picked up again or a new research project is undertaken. 55 A further embodiment of the invention may be combined 

Still another problem is that after trying multiple Boolean ^ ^ of me Previously defined embodiments of matrix 

phrases and obtaining and relying on the results obtained searching in accordance with the present invention, and 

with one or more of the searches, the user may have mcludes the forther steps of receivmg a wUection ctocument 

difficulty ascertaining or defending the quality of the search. command from the user, and generating a collection docu- 

This is the problem that may be encountered by students, as 60 ment m response, receiving a document selection command 

well as consultants and analysts when having to defend the from the displaying a document or other electronic file 

facts, analysis or conclusion presented in a final paper based m response receiving a portion storage command, and 

on research results copying of information into the collection document from a 

portion of the displayed document corresponding to the 

SUMMARY OF THE INVENTION 65 p 0rt i 0I1 storage command. 

The present invention provides a structured, concept- A still further embodiment of the invention includes an 

exhaustive method for searching databases for documents organizing step which may be combined with any of the 
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previously defined matrix searching with collection FIG. 8 shows an example display of a search result linking 

embodiments, and includes the further steps of receiving a . feature in accordance with the method of the present inven- 

user-entered document tag data, and storing an information tion; 

into the collection document corresponding to the received mG 9 shows an examp i e graphical user interface for 

document tag data and a portion of the displayed document 5 sea rching the same queries as FIG. 8, using a different search 

corresponding to the portion storage command. An optional database and ^ {Q a third xmh m 

feature of this embodiment mcludes a user-entered relational , , _ _ L . 

database information data with the document tag data. A FIG. 10 shows an example display of a hyperlink accessed 

further optional feature of this embodiment includes steps of trough the example graphical user interface display of FIG. 

receiving a collection document store command from the ^» 

user, and storing the collection document into a collection FIG. 11 shows an example of a user collecting informa- 

database in response, and repeating the step of matrix tion from the example graphical user interface of FIG. 10; 

searching to including searching the collection database. and 

A further embodiment of the invention includes a report- FIG. 12 shows an example of a search collection docu- 

ing step which may be combined with any of the previously me nt into which the information highlighted in FIG. 11, and 

defined matrix searching with collection and organizing related user-generated tags, are inserted, 
embodiments, and includes the further steps of receiving a 

user-entered link analysis generation command, identifying DETAILED DESCRIPTION OF THE 

information contained in the search result documents that is INVENTION 

common between two or more search concepts, and gener- A memo d for a structured, but flexible search using an 

ating a link document having a link information reflecting integrated graphical user interface for entering a user- 

the information identified as common. An optional feature of specified matrix of both pre-defined and user-defined search 

this embodiment includes a step of generating a graphical concepts, performing a complete and exhaustive search, 

link chart showing the link information. presenting the user with a comprehensive overview of the 

A still further embodiment of this invention comprises 25 search statistics corresponding to the matrix of concepts, and 

any of the previously defined embodiments combined with ^th information from documents relating to selected search 

a step of drill down matrix searching, the drill down matrix concept, and quickly linking the user to the search docu- 

searching comprising the step of receiving a cell search ments from the integrated graphical user will be described, 

command from the user, receiving a new plurality of search Also described are additional features and embodiments, 

concepts from the user, the receiving including entering or 3Q including analyzing the results of the search in relation to the 

designating a first plurality of the new search concepts as a ^arch ma trix, collecting the search results with assistance of 

first search vector defining a first dimension of a new matrix, a display of the search matrix, organizing the collected 

and entering or designating a second plurality of the new search results by receiving tag and relational database data 

search concepts as a second search vector defining a second fr om the user and merging that data with the collected search 

dimension of the new matrix. This embodiment then 35 results, and generating a link report identifying and showing 

searches, based on the new matrix, the documents and other linkages between search concepts reflected by information 

electronic files represented in the search results within a cell in the search results. 

corresponding to the received cell search command. The following description includes numerous example 

These and other objects, features and advantages of the details and specifics which pertain only to the specific 

present invention will become more apparent to, and better ^ examples presented herein, and are included only to assist in 

understood by, those skilled in the relevant art from the describing these specific examples, and thus assist the reader 

following more detailed description of the preferred embodi- ^ understanding through example the features and elements 

menu of the invention taken with reference to the accom- 0 f ^ prcsent invention. It will be evident to ones skilled in 

panying drawings, in which like features are identified by mc Mi that ^ invention can be practiced without, and with 

like reference numerals. 4S different ones of, these details and specifics. Further, this 

BRIEF DESCRIPTION OF THE DRAWINGS description assumes an ordinary skill in the art of conven- 

■w-rw j-\ . * tu-i-i in i a x. . c tional Boolean and other commercially available search 

FIG. 1 is an example high level functional flow chart of . ♦ j _j j j j ♦ u 

,« j ,. f t . * • engines, known, standard network and database structures 

a method according to the present invention; 5 . ' ' * . , . . . 

* , , . . , . , . _and mterface_protocols,.and_conventional-programming-m,- 

- -FIGr2 shows an example-graphical user mterfice display ^ for e le , Basic(Bj c++} or Java(g)) mtm]ng under) 

of a search matrix in accordance with the method of the for examplCj Microsoft Windows 95®, Microsoft Windows 

present invention; jjnux, Sun Solaris®, or Apple OSX®, or other 

FIG. 3 shows an example graphical user interface of an equivalent commercially available operating systems, 

indexing step and concept definition step in accordance with FIG. 1 shows a high level functional flow chart of a 

the method of the present invention; 55 preferred ^b^ent of the invention. It will be understood 

FIG. 4 shows an example of three^imensional graphical ^ j^q i ^ not a limitation on the invention, and that the 

search matrix display in accordance with the method of the function and novelty of the invention does not depend on all 

present invention; of depicted functional blocks being present. Instead, the 

FIG. 5 shows an example display of a search result speed invention can be practiced and its benefits over the existing 

reading feature from the graphical user interface of FIG. 2; 60 art obtained using only a subset of the depicted number of 

FIG. 6 shows an example of the search graphical user functional blocks, as will be better understood upon reading 

interface according to FIG. 2, using different search queries, this description in view of the accompanying drawings, 

with an overlay of a jump window using a browser feature Further, the particular segmentation and labeling of FIG. 1 

of the present invention; is not a limitation on the particular software coding scheme 

FIG. 7 shows an example graphical user interface for a 65 or software module structure for implementing the inven- 

multimedia search information collection step in accordance tion. As will be understood by persons of ordinary skill in the 

with the method of the present invention; arts relating to this invention, the described invention can be 
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readily implemented with a segmentation, arrangement and 
labeling of functional blocks differing from the example 
shown in FIG. 1. 

Referring to the example flow chart of FIG. 1, user first 
selects, at block 100, one or more global data sources 
"GDS". Global data sources, or GDS, is defined herein as 
the universe of databases which the user wishes to search. In 
the example described here, step 100 is where the user 
selects the database that he or she will search. It is assumed 
that a system setup has made the databases available to the 
user, according to computer and database set-up procedures 
well known to persons of ordinary skill in the art relating to 
this invention. For example, if the World Wide Web, or other 
Internet-related database resource or networked database is 
among those that can be selected by the user, then the user's 
computer (not shown) will typically have an Internet web 
browser, such as Microsoft Internet Explorer®, and the 
user's computer will typically be connected via modem to an 
Internet Service Provider, such as America On Line. 

For purposes of this description, "database" means any 
stored collection or aggregation of digital information, 
which may be arranged in and according to any of the 
various formats known in the art including, for example, 
records, tables, word processing documents, text documents, 
sound files, and image files. These digital information for- 
mats are collectively referenced herein as "electronic files". 
The databases selected at step 100 may be in accordance 
with any of the structures known in the art. It is assumed that 
the person of ordinary skill in the arts relating to this 
invention has a working knowledge of the available database 
structures and, therefore, a detailed description of database 
theory is not necessary for an understanding of this inven- 
tion. 

For purposes of example, the World Wide Web is one of 
the databases GDS which may be selected at step 100. As 
known in the art, the "database" embodied in the World 
Wide Web comprises more than one billion electronic files, 
which are "posted" by storing them within one or more web 
servers (not shown) that are connected to and accessible 
through the Internet. Each posted file is addressed, and 
accessed, by its Uniform Resource Locator (URL) value. 
Electronic files posted in this manner are generally refer- 
enced as "web documents". Business entities such as 
Google® and Yahoo® maintain an index of some, but not 
all, of the publicly available web documents. As will be 
described in more detail with respect to the indexing step 
102, some of the web documents' index entries include key 
words and other file description data characterizing the wen 
document: 

For purposes of example, another type of database known 
in the art which may be selected at step 100 for searching 
stores electronic files in a name table having, for each file, 
addressable locations within a storage media identifying 
where the file is stored. Example storage media includes a 
magnetic disc storage unit, an optical disc storage unit, a 
solid state storage unit, or networks thereof. Commercially 
available products, methods and protocols of this type of 
database are well known in the art. 

Still another type of database which may be searched in 
accordance with this invention, and thus selected at step 100, 
is the "relational database". The term "relational database" 
is defined herein in accordance with its established meaning 
in the relevant art. An example of a relational database is 
Microsoft Access®. 

The GDS databases selected at step 100 may include 
publicly accessible web documents which, although posted, 
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do not have their URLs listed by well-known entities such 
as Yahoo® and Google®. Further, the selected GDS data- 
bases may be databases for which the documents themselves 
are not "posted" but, instead, have a posted web document 

5 which is an index of that database. As known in the art, 
search engines may access such an index, either directly or 
indirecdy through the indexes posted by entities such as 
Yahoo® and Google®, and obtain "hits" referencing the 
user to the home page of the database. Typically, as known 

10 in the art, the user then accesses the specific documents 
using another search engine featured on that home page. 

Still other databases selected at step 100 may include, 
depending on particular system access privileges that the 
user possesses, controlled access databases owned or 

15 controlled, for example, by business entities such as banks, 
insurance companies, and medical institutions, or by gov- 
ernment entities. The specific structures and protocols of 
such controlled access databases, as well as the hardware 
and software resources for maintaining them, are known to 

20 one of ordinary skill in the arts relating to this invention, can 
be readily interfaced with the method of the present 
invention, and therefore description is omitted. 

The particular user interface by which the user selects 
databases at step 100 is a design choice readily implemented 

25 by one of ordinary skill in the arts pertaining to this 
invention. An example is shown in FIG. 2, which is an 
example graphical user interface for performing the search 
step 104 described below. Referring to FIG. 2, the depicted 
example graphical user interface 10 has a data entry field 12 

30 into which the user enters the name of the GDS databases 
selected at step 100. As discussed in reference to the 
indexing step 102 below, the user can enter an index name 
into field 12 instead of a database name. In fact, the example 
search shown in FIG. 3 searches an index, which is named 

35 "Medical". 

Referring again to FIG. 2, a further option in the graphical 
user interface 10 is a field 14 for the user to specify a higher 
level domain in which the GDS database is found. The FIG. 

^ 2 example graphical user interface 10 shows the field 14 
having a domain labeled "local". Referring to the FIG. 2 
example, the "local" domain identifier in field 14 informs the 
user's database access computer (not shown) that the "Medi- 
cal" database index is local to the computer. As known to 

45 ones skilled in the relevant arts, the meaning of "local" is 
determined by the particular computer and operating system 
on which the invention is being practiced. For example, as 
known to ones skilled in the relevant arts, within a computer 
system running under the Windows 95® or NT® operating 

~~system7 "local" can be defined a^one~or _ more hard drives 
(not shown) or other storage media connected directly to the 
computer, or shared by multiple computers connected by a 
LAN (not shown) to one or more servers (not shown). 
It will be understood that a plurality of GDS names or 

55 identifiers may be entered into, for example, field 12 of the 
graphical user interface 10 of FIG. 2. 

It will also be understood that step 100 may be omitted, 
and the invention practiced with, or with the user being 
limited to, default databases. Further, it will be understood 

60 that step 100 may be an automatic selection process, using 
a computer program readily composed by one of ordinary 
skill in the art for selecting one or more GDS databases in 
accordance with the subject matter searched. 
After the user selects the GDS databases for searching at 

65 step 100, or as an optional first step if the selection step 100 
is omitted, the user may employ step 102 to pre-index the 
databases to be searched. The pre-indexing step 102 is not 
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required for practicing the method of this invention. window labeled INDEX having a field area 300. The 

However, as known in the relevant arts, query searching for INDEX field area 300 is for designating the indexing 

files meeting a criteria is typically much faster if it is operations including, for example, Create New Index field 

performed, at least in part, on an index of the files instead of 302, Add Files to Index 304, Re-index 306, and Compress 

the actual content. The general reasons are well known, and 5 Index 308. The FIG. 3 INDEX graphical user interface 

include the fact that, typically, index files are typically much further includes a field 310 for the user to enter a name for 

shorter than the full text files that they characterize, and that the index, a field 312 for the user to enter the name of the 

index files are typically formatted so that specific informa- directories to be indexed, a field 314 for the user to enter the 

tion is in specific fields. However, as is also known in the art, names of the folders within the directories) entered in the 

there are typical shortcomings with index-based searching. field 312 to be indexed, and a field 316 for the user to enter 

These include the fact that the ultimate coverage and accu- the names of folders to be excluded from the indexing 

racy of the search depends, in part, on the accuracy of the operation. 

index. The FIG. 3 graphical user interface 300 is for purposes of 

The specific types and methods of indexing performed at example only, as the design of graphical user interfaces for 

step 102 are in accordance with any of the indexing schemes 15 selecting the files or databases to be indexed is a design 

known in the relevant arts. Step 102 may, for example, choice, re adily constructed by one of ordinary skill in the art 

process each file or document in the selected databases GDS of computer programming. One typical graphical user inter- 

and generate a key word or key feature profile record face f° r ar * indexing that could be incorporated and used at 

corresponding to that file. As known in the art, the World sle P 102 » mat shown by the Microsoft Explorer® feature 

Wide Web uses this type of indexing. More particularly, 20 °f the Windows 95® operating system, 

entities such as Google® and Yahoo®, maintain indexes As known to persons skilled in the arts relating to this 

based on the web documents being in the Hypertext Markup invention, database indexes such as those generated at step 

Language (HTML) format, which has metatags within the 102 are typically not updated each time that the indexed 

document itself describing its features. The indexes main- databases are searched. Depending on the size of the 

tained by entities such as Google® and Yahoo® typically ^ database, and the level of the index, such updating may 

correspond to, or contain, these HTML metatags of the title, require substantial time, and may render the database inac- 

summary description, and key words. Further, the file cessible or lessen the access performance while the process 

description data may be provided by the person or business is ongoing. The database index is therefore, typically 

entity that owns the web document. In addition, services updated periodically, or in response to a specific command 

such as Google® and Yahoo® may send, typically via the 30 entered by the user. The criteria for determining the fre- 

Internet, a file extraction program, frequently referenced in quency of updating the index are known to persons skilled 

the art as a "spider", to extract additional data for creating in the arts relating to mis invention and, therefore, descrip- 

the index. tion is omitted. 

Notwithstanding the indexes maintained by business enti- Referring to FIG. 2, which is an example graphical user 

ties such as Yahoo® and Google®, step 102 may index the 35 interface for performing the search step 104 described 

World Wide Web using the above methods. Methods for below, the indexing step 102 stores the index in a file (not 

generating indexes of the World Wide Web are well known shown), and the user can scroll through this as shown in the 

in the art and are therefore not described here. As known in FIG. 2 field 12. The specific example index shown in field 

the art, such indexing is a substantial task requiring consid- 12 of FIG. 2 is named "Medical". 

erable resources, although there may be reasons for step 102 40 As stated above, the indexing step 102 is not required for 
to generate such an index. Consideration of these must, as practicing this invention. The search step 104 can search 
known to persons skilled in the art, be in view of the indexes created by others, such as the Yahoo®, Google® 
particular objectives and resources of the user. For example, and Altavista® indexes. Search step 104 can search non- 
it is well known that many commercial World Wide Web indexed GDS databases. Search of non-indexed databases is 
indexes are typically incomplete and have inconsistent accu- 45 performed in a manner comparable, in part, to the methods 
racy. In addition, such indexes are generally formatted for used by existing query -based search engines to search 
interfacing with the entities' own search query engines. non-indexed files. More specifically, as known in the art, 
Therefore step 102 may include generation of a customized such searches typically examine database files one at a time, 
World Wide Web index, additional to those available scanning forjhejniery_search_te 

through Yahoo®,"Google® dFevelTAltavisTi®. ~ 50 art, such searches typically require considerably greater time 

Typically, however, the GDS databases that the user than indexed searches, 

would select for indexing at step 102 would be ones other Referring to the example flow chart of FIG. 1, after 

than the World Wide Web. There are numerous indexing selecting a GDS database at step 100 and indexing one or 

schemes, methods, and numerous products for performing more of the selected databases at step 102, or as a first step 

the same, which are well known in the arts relating to this 55 if these steps are omitted or established as a default, the 

invention. The specific indexing schemes and methods method of this invention performs the search step 104. As 

employed at step 102 would, as is known to persons skilled will be described, search step 104 comprises receiving a 

in the art of database management, is a design choice and plurality of user-entered search query terms, which are 

would be selected and/or written based on factors including labeled for reference as CONCEPTS, and assigning or 

the storage capacity of the processing resources hosting the 60 arranging the CONCEPTS into two sub-pluralities, which 

database(s) and the search performance desired by the user. will be referenced as ROW CONCEPTS and COLUMN 

One example that is widely available is the commercial CONCEPTS. As will be further described, the CONCEPTS 

database management and indexing program sold as Veri- search query terms may be pre-defined, such as words of the 

tas®. English language, or may be defined by the user as will be 

Referring to FIG. 3, in a preferred embodiment of the 65 described in greater detail below, 

invention the indexing step 102 is carried out by generating Referring to the example graphical user interface shown 

a graphical user interface display such as the depicted in FIG. 2, step 104 arranges or assigns a first plurality of the 
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received CONCEPTS into a first set of M CONCEPTS, 
which are labeled herein for reference as ROW CONCEPTS 
(i), the index "i" ranging from i=l to M, with an example set 
shown in field 16 of the figure. The step further arranges or 
assigns a second plurality of the received CONCEPTS into 
a second set of N CONCEPTS, which are labeled herein for 
reference as COLUMN CONCEPTS (j), the index "j" rag- 
ing from j-1 to N, with examples shown in field 18 of the 
figure. The arrangement or assignment of CONCEPTS as 
ROW CONCEPTS and COLUMN CONCEPTS is by user 
choice. 

It will be understood that the above reference labels of 
COLUMN CONCEPTS and ROW CONCEPTS, their 
respective indices "i" and "j", and the population labels of 
"M" and "N" are merely for consistency of reference in 
describing the method of this invention. These labels and 
indices are not a limitation on the present method, as persons 
of ordinary skill in the computer arts relating to this inven- 
tion can readily identify, upon reading the instant 
description, many alternative label and indexing schemes for 
practicing the inventive method described herein. 

Referring to the example graphical user interface shown 
in FIG. 2, after the user has entered the desired M+N 
CONCEPTS, in two sets arranged, for example, as the ROW 
CONCEPTS©, i=l to M, and COLUMN CONCEPTstj), j-1 
to N, the user clicks on field 20, which is labeled "Search". 
In response, step 104 performs M query searches, one for 
each ROW CONCEPT(i), of the GDS database or database 
index identified in field 12 and, for each search, identifies all 
data files having a predetermined type of occurrence of a 
word or phrase that is within the definition of that ROW 
CONCEPT. For purposes of this example "predetermined 
type of occurrence" means an occurrence anywhere in the 
document. Step 104 also performs N query searches, one for 
each COLUMN CONCEPT© and identifies all such data 
files having an occurrence of a word or phrase that is within 
the definition of that COLUMN CONCEPT. 

For each ROW CONCEPT® and COLUMN CONCEPT 
(j) searched, step 104 generates a HITS record, labeled for 
reference herein as HITS(ROW CONCEPT(i)) and HITS 
(COLUMN CONCEPT©). The HITS records include a K 
number, which equals the number of documents in the 
record. The K number is referenced herein as K(ROW 
CONCEPT(i)) and K(COLUMN CONCEPTQ)). The labels 
of ROW CONCEPT and COLUMN CONCEPT are for 
purposes of reference only, as each is a CONCEPT as 
defined herein. As will be understood, the assignment of 
ROW CONCEPTS and COLUMN CONCEPTS is a label 
for describing operation of separating the user-input CON- 
CEPTS into" two~groups7"se^rching^ach~member of eactf 
group individually, and then forming all possible pairs of 
ROW CONCEPT with a COLUMN CONCEPT, the pair 
being referenced as a PAIR CONCEPT(ij), i=l to M,j=l to 
N, and searching each pair. 

Referring to the example graphical user interface shown 
in FIG. 2, the hit count K(ROW CONCEPT(i) for each of 
the ROW CONCEPTS is displayed in a ROW CELL(i) 
located, for example, within an ascending ordered vertical 
column in field labeled 22, which is located, in the example, 
to the left of the column of ROW CONCEPTS in field 16. 
Likewise, the hit count K(COLUMN CONCEPT(j) for each 
of the COLUMN CONCEPTS is displayed in a horizontal 
COLUMN CELL(i), which is located, in the example, in 
field 24 above the row of COLUMN CONCEPTS in field 
18. 

In a preferred embodiment, step 104 downloads each 
matching document into a storage (not shown) that is local 



to the general purpose computer with database access capa- 

. - bility (not shown) on which the user is interfacing with this 
method through, for example, the graphical user interface of 
FIG. 2. The search step 1004 thus generates records HITS 

5 (ROW CONCEPT©) and HITS(COLUMN CONCEPTQ)) 
having information uniquely identifying each document 
found in the GDS database with information matching the 
referenced CONCEPT, and automatically downloads the 
identified documents into the HITS records. For example, if 
the selected GDS database is the World Wide Web and step 
104 identifies, for a particular ROW CONCEPT(i), twenty 
web pages having matching information, the step 104 forms 
the record HIT(ROW CONCEPT(i)) as a list of twenty 
URL's, and other information as described below, and a 

15 downloaded copy of each of the twenty web pages. 

Referring to the specific example in FIG. 2, ROW 
CONCEPT(l) is "hearf \ and ROW CELL(1) displays "15", 
meaning that K("heart") is "15" and, therefore, fifteen 
documents or other electronic files within the database being 

20 searched, which in this example is "Medical", contain a 
word or term within the definition of the CONCEPT of 
"heart". ROW CONCEPT(7) of the FIG. 2 example is 
"eyes" and ROW CELL(7) in field 16 displays "1", meaning 
that K(eyes) is "1" and, therefore, the GDS database had 

25 only one document or file meeting the definition of the 
CONCEPT "eye". Similarly, COLUMN CONCEPT(9) of 
the FIG. 2 example is "hypertension" and COLUMN CELL 
(9) in field 24 displays "8", meaning that K("hypertension") 
is "8" and, therefore, the GDS database had eight documents 

30 or files meeting the definition of the CONCEPT "hyperten- 
sion". 

Next, step 104 performs a systematic search of the GDS 
database using every possible pair of a ROW CONCEPT 
and a COLUMN CONCEPT. The specific order of the 

35 automatic pair formation and searching is a design choice. 
For purposes of example, the pair formation process begins 
by selecting the first ROW CONCEPT, which is ROW 
CONCEPT(l), logically pairing it, sequentially, with each 
individual COLUMN CONCEPT©, j=l to N, and, for each 

40 pair, performing a "search" of the selected GDS databases. 
Each pair is referenced herein as PAIR CONCEPT(l, j), j=l 
to N. The process then selects ROW CONCEPT(2) and 
logically pairs it with each of the COLUMN CONCEPTS, 
to form N new PAIR CONCfePTs(2, j), j-1 to N. The step 

45 also searches the GDS database using each of the PAIR 
CONCEPTS. The process repeats until it has selected the 
M* ROW CONCEPT, paired it with each of the N COL- 
UMN CONCEPTS, and searched the GDS database using 
each . The fogical^per^qn^oxthe^pajring may_be.selectable _ 

50 but, preferably, it is the Boolean AND function. Therefore, 
assuming that the AND function is used, PAIR CONCEPT 
(1, 1) is ROW CONCEPT(l) AND COLUMN CONCEPT 
(1). 

Referring to the specific FIG. 2 example graphical user 
55 interface, the step 104 search of the GDS database using 
each of the PAIR CONCEPTs(i, j), i«l to M, j-1 to N, 
generates a (MxN) records, labeled for reference as HITS 
(PAIR CONCEPT(i, j)), each identical in structure to the 
above-described HITS(ROW CONCEPT©) and HITS 
60 (COLUMN CONCEPT(j)) records. The step displays the hit 
count for each, referenced herein as K(PAIR CONCEPT(i, 
j)), in a corresponding PAIR CELL(i, j) in field 28 of FIG. 
2. 

The search query operation performed for each 
65 CONCEPT(i, j) is identified as a "search", but it is contem- 
plated that, in some applications, it may be unnecessary to 
have an additional interface with, or query of, the selected 
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GDS databases. Instead, the "search" for each PAJR user, by clicking on the "Search" button 20 one time, with 

CONCEPT(i, j) could compare HITS(ROW CONCEPT(i)) the results of M plus N, plus (M times N) searches. The user 

with HITS(COLUMN CONCEPT(j)) and identify all docu- then sees in, for example, the graphical user interface of 

ments or files appearing in both HITS records. In a preferred FIG. 2, the comparative frequency of occurrence of all M of 

embodiment, though, step 104 perform an actual query- 5 the ROW CONCEPTS, all N of the COLUMN CONCEPTS, 

based search on the GDS database using each PAIR and the (MxN) pairings formed by PAIR CONCEPTS(i, j), 

CONCEPTS j). Searching each PAIR CONCEPT(i, j) may for ical 10 M ' -H 1 t0 N - ^ * substantially more efficient, 

be preferable if, for example, the user had entered a limit and more exhaustive in its search coverage, than the prior art 

(not shown) on the number of HITS for the ROW CON- method of picking a search query term, or formulating a 

CEPTS and/or COLUMN CONCEPTS. For example, if a 10 f^ e B ° olea ^ ™* KSSl ™> P 6 *™"? a ^serving 

user or particular embodiments placed a limit of one nun- * c and pickm * * > ~ 3B ™" ° f ^ tenD5 ™* 

dred documents in each HITS record, and the GDS database ^ result ^ n ^ }° * e ™* , 

were the World Wide Web then the number of documents JZ^^hn^^^ 9 d '*«* h m ^ 

i • . * ♦ urrc , . ... „ 4 f4l _ the number of ROW CONCEPTS is seven and the number 

listed [id the HITS records might be a smaU peranUge of ±e Qf C0LUMN CONCEPTS is twelve. Step 104 then gener- 

matchmgdocumcnte.Itfc 15 atcs a H ITS record for each of the seven COLUMN 

■ Automobile", and COLUMN CONCEPT(l) is "Spanish" CONCEPTS, a HITS record for each of the twelve ROW 

then the number of hits, i.e., K(automobile) and K(Spamsh), CONCEPTS, and (seven times twelve), or eighty-four, HITS 

is likely to be in the tens of thousands. The documents in the records for each of the PAIR CONCEPTs(i, j), i«l to 7 and 

record HITS(Spanish), and HITS(automobile) would have j^l to 12. The total number of searched query terms and 

only hundred of these. A comparison of the respective 20 expressions of the same, and the number of HITS records 

records of HITS(Spanish) with HITS(automobile) would, showing the search result, is therefore, in the specific 

therefore, have a significant probability of missing many example shown in FIG. 2, one hundred three. To have this 

documents which, although having both "automobile" and coverage with the method of the prior art the user would 

"Spanish", were omitted from one or both of the one have to manually perform one hundred one searches. The 

hundred documents listed in each of the HITS records. If, 25 user would then have to write down or otherwise store, the 

however, step 104 performed an actual query based search results of each search. Even if the user managed to perform 

of the World Wide Web using the paired CONCEPT such a task > which wouJd be a formidable job to complete, 

(automobile, Spanish) it is likely that the search would he or she would have expended considerable time that could 

return a usable quantity of matching documents. nave been otherwise used more effectively, and would not 

The above-described search step 104 may include a filter 30 have mc bcncfit of a sk S lc matrbc formatted display of all 

that is additional to the criteria set by the CONCEPTS. The mc scarch results 88 * Presented in FIG. 2. 

filter may qualify documents based on features including, ^ above-identified step 104 searches for each ROW 

but not limited to, date, source, author, format and size. For CONCEPT, each COLUMN CONCEPT and each PAIR 

purposes of this example the type of occurrence is anywhere CONCEPT(i, j) may be performed sequentially, or in par- 

in the document 35 a ^ (** e -' concurrent), or as a combination of sequential and 

The specific code-level process of searching the database p!aaU ° l A f Il k T r "J"? ^ 

in field 12 for occurrences of each ROW CONCEPT©, *f arch f '. ,he f SCt ™ ^ tWeen ««« P«aUel 

COLUMN CONCEPTQ and PAIR CONCEPT(i, j) is a m ** n ° n fte s ^ m of . the datab ^ "* th ° 

design choice, readily made by one of ordinary skill in the « eafic 50 f<™ "f'f of ** *«* J** current 

art. As known in the art, the specific code-level process 40 ^mercahzed reductions ^ practice of the ' ^ mv«mon 

depends, in part, on the structure of the database searched, use a . S *? Uent,al ^{^^°^ <8 * 

whether or not the search is index-based, and on the inter- Searcl * d ' a r ~° rf ° f ONCEPT (1)) « 

face requirements particular to the commercial database generated, and then COLUMN CONCE^ 2 inarched and 

management system (DBMS) on which the database in field 80 ™ ot A mc COLUMN CONCEPTS have been 

12 is structured 45 scarcDCc1 ' Related to the sequential search is a STOP 

_ _ . ' _ t , . , , SEARCH user interface button (not shown) appearing on the 

Referring to the FIG. 2 example graphical user interface, gra phical user interface such as the example of FIG. 2. As 

PAIR CONCEPTS 1) as identified above is the CONCEPT described above> m a preferred embodimen \ of to step 104 
"heart" ANDed with the CONCEPT "treatment". This is_ ^uential search the HITS record is generated and-the— 

referenced^here^ 50 oumbers K(CONCEPT) are displayed in sequence as the 

shown in FIG. 2, PAIR CELL(1, 1) displays "14' , meaning ^ rch progresses through the CONCEPTS. It is contem- 

that fourteen documents or files within the GDS database lated mat particular choices of CONCEPTS and/or user- 

had information meeting the definition of "heart" and meet- entered definitions of CONCEPTS, the definition process 

I^J™/. 6 D i a °l,° f "l 1 "^ 1 " . Simil ?}y> ? A }* being described below, will yield HITS records necessitating 

CONCEPT(4 .6)if ("^yer "stroke ). jt l is seen from the ;«0» 55 a change of C0N CEPTS. The STOP SEARCH button 

displayed in PAIR CELL(4, 6) that K("W , "stroke ) I is enables ^ ^ to st0 lhe st 104 ^ch accordingly. 

V meaning that the "Medical mdex of this example did Rcferrm tQ piGS % ^ 3 ^ ^ q£ ^ 

not contam any documents or files havmg mformaUon C0NCEPTS ^ bc describcd HG 2 M described abov * 

rwrlw* tf- CONCEPTtW ^ ^ is an example graphical user interface 10 for performing the 

^utN^tKi stroke . w seafch gtep m ^ 3 ^ example graphical ^ 

As described below, the scope of information that is interface 320 for defining the CONCEPTS. Field 320 

within the meaning of a CONCEPT is determined by the includes a CONCEPT LIST 322 which lists all CONCEPTS 

user-entered, or previously stored Boolean phrase defining that one or more users have defined. The list of CONCEPTS 

the CONCEPT, and by EXPANDORS added by the user to can t^ global, or can be particular to a user, the latter being 

the typed letter strings appearing in fields 14 and 18. 65 written according to standard software design practice for 

As described and shown by the FIG. 2 example graphical multi-user programs having one or more parameter lists 

user interface, the method of this invention provides the specific to particular users. 
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Field 322 of FIG. 3 displays the following CONCEPTS: COLUMN CELLs in field 24, and PAIR CELLS in field 28, 

"AIDS", "CHEMWPNS", "DRUGS", "HEARATK", reflecting the HITS record obtained for each of the ROW 

"MDR'% "MEXICAN", "OC", "TERRORISM", CONCEPTS, COLUMN CONCEPTS, and PAIR CON- 

"USEAST", "USGSI", "USSOUTH", "USWEST*, "VIO- CEPTS (i, j). Referring to FIG. 4, an optional feature of the 

LENCE" and "WEATHER". The FIG. 3 example provides 5 invention displays the K numbers for each HITS record as 

the user means to highlight one of the CONCEPTS in the list a ^ree dimensional graph 400, with the height of each cell 

322, such as the "WEATHER" that is highlighted at 322 A in g^P 0 * a representative one being labeled as 402, represent- 

the figure. The code for displaying a list of entries in a file, ^ scarch rcsults of « n ' s associated COLUMN 

scrolling through the displayed entries, and highlighting one CONCEPT, ROW CONCEPT or PAIR CONCEPT(i, j). The 

or more of the list using, for example, the left click of a to FIG - 4 exam Pk sets the height of each cell graph 402 

conventional computer mouse is well known in the relevant according to the number of documents or files meeting the 

arts and, therefore, description is omitted. Field 324 displays criteria of the cell's corresponding CONCEPTS. The FIG. 4 

the selected CONCEPT and field 326, labeled display also includes a legend 404 which shows color codes 

EXPANSION, shows the Boolean expression that defines it. of CONCEPTS embodied by the COLUMN CON- 

c- %a *™ f ct/~* i * *u .u i > t 1 * CEPTS and ROW CONCEPTS. The cell graphs 402 of each 

Field 328 of FIG. 3 presents the user with a list of " rmammr -\ i i j j j r 

CONNECTIONS and EXPANDORS, which are Boolean f ^ N( ^^TX^^ ™m ^m 

operatorsandsearchqueryex^ «s constituent i ROW CONCEPT and j* COLUMN CON- 

ing and modifying terms within the expression appearing in n ' 

field 324. Example CONNECTIONS in the FIG. 3 field 328 Referring to the high level flow chart of FIG. 1, and the 

are: "AND", "OR" and "WTTHIN#". Example EXPAND- 20 graphical user interface FIGS. 2 and 5, an analysis step 106 

ORS are: "Concept®", "Fuzzy "NOT", "PHONIC", mat 15 closcl y integrated with the search step 104 will be 

"STEMMING", and "SYNONYMS". Each of the example described. Step 106 can be used after each iteration of search 

CONNECTIONS is a Boolean logic term that is known in ste ? 104 and ' 1 based ° n commands received from the user, 

the art relating to this invention. The example EXPAND- 4551515 in analvzul g the *«ch using information provided 

ORS of "NOT', "PHONIC", "STEMMING" and "SYN- 25 b y HITS records generated at step 104. It will be 

ONYMS" are likewise well known, and further description understood that describing step 106 as the "next step" is for 

is therefore omitted. The example CONNECTION labeled ^ descnbmg an example operation of this inven- 

"Concept@" denotes the CONCEPT following it as, itself, ^ M ^ understood the analysis step 106, as well as 

another defined CONCEPT. For example, a CONCEPT * c late < "described steps .of collection a«l reporting, do not 

typed into field 324 as "ROAD" would be defined as the four 30 have to be competed I before re-ninmng . step 104, either with 

letter string "R,0 AD". The search step 104 would therefore d f crent ROW CONCEPTS and COLUMN CONCEPTS, or 

look for this four letter string when searching the CDS after returning to steps 100 and 102 to select different CDS 

databases. On the other hand, if the user had defined the ^tabasw or m u dexmg °P tn ^f 00 the Mso > » will 

CONCEPT "ROAD" using the FIG. 3 example graphical be descnbed > the P reseQt method P e ™its the user to perform 

user interface, and then stored it in the list of CONCEPTS 35 * "drUl down" search ion the docmn^ WIS 

in field 322, then subsequent use of that definition could be f * °/ Jj£*? W CELLS ' COLUMN CELLS, or 

invoked by typing it as "@ROAD". An example definition PAIR CELLS of mG - 2 * 

of "ROAD" would be "street OR avenue OR boulevard". Referring to FIG. 2, after the step 104 search the user is 

The example CONNECTION labeled "Fuzzy where presented with a two dimensional array of HITS records, 

"fuzzy" is defined according to its understood meaning in «0 ^ te number of documents within the HITS record 

the arts pertaining to this invention. ~ appearing in the cell corresponding in position to the CON- 

Ao ™ u 0 a u„ cUiu-A ;„ CEPTS searched. As described above, each HITS record 

As can be understood by persons skilled in the arts t . A .~ c , Ca , . ' . . . . 

relating to this invention, the use, power and computational <! ODtains ™ fT^rl^ ^m^T Y 

overhead relating to particular CONNECTIONS and * C ^ arch f °' the ^ CEPT ° r P ™ °^ C °^ EFTS °°/ u " 

EXPANDORS depends, in part, on the particular structure of 45 to ' hc aU ' ™ e ^ r c ™ ^n Jnghhght any of the 

the CDS databases selected at step 100, in addition to the cells U5lD ^ for ex * m P le > the , le ^ hck of * ™f« na ! 

indexing scheme, if any. computer mouse whereupon step 106 presents, in field 30 of 

„ . ' FIG. 2, the ordered list of documents identified in the record 

Referring to the FIG. 3 example, the selected CONCEPT HITS corresponding to the ^elected cell. For example, in_ 
is^EAmER",which^is defined m-field-66 as "stormor-yFIGT^ RAIR CELL(1, 1) in field 28 is highlighted. PAIR 

rain or winds or tornado or "hurricane or snow or "flood or CELL(1, 1) corresponds to PAIR CONCEPTS, 11 which is 

storm surf or weather-related or NOAAor thunderstorms or (heart , trea tment). As shown by the number "14" appearing 

El Nino". The tilda sign "" is a wild card. The Boolean OR ^ pAIR CELL(1> x) seven documents wcre found ^ the 

operator m field 328 connects ^^^^^^^^ "Medical" database index having information within the 

expression in field 326 defines the CONCEPT "WEATHER" ss mcaning of ^ « hcart » CONCEPT and the "treatment" 

as any word or information that meets any term in the CONCEPT. 

expression. wheQ ^ ^ highlights pAIR CELL(1, 1) the display 
The CONCEPT definition field 320 of FIG. 3 allows a field 30 presents, in the FIG. 2 example, the first five 
user to define existing words in a manner suited for the documents in the HITS(heart, treatment) record. The display 
particular search, and to coin new CONCEPTS. For 60 field 30 has, for each document listed, a field 30A for the 
example, if a user wished to define "ski" to focus on water document order number, a field 30B for the document's file 
skiing and not include snow skiing it could be entered in the name , a field 30C for the size in bytes of the document, a 
list 322 and defined in field 326 as SKI AND WATER AND field 30D for the title of the document, if any exists, a field 
BOAT. Alternatively, the user could coin a word, such as 30E identifying a date associated with the document, and a 
SNSKI, to have the same meaning. 65 fie i d 30F identifying the GDS database or database index 
FIG. 2 shows the search results of step 102 as a two- name in which the document was found As will be under- 
dimensional array of numbers, as ROW CELLs in field 22, stood by one of ordinary skill, depending on the GDS 
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database searched, and whether or not the GDS was indexed, As seen from the example text appearing in field 600 of 

the documents . may not have "names" and, in such cases, FIG. 6, the browser automatically highlights all occurrences 

there may be no data appearing in field 30B. Similarly, in of the search CONCEPTS within the selected document, 

some instances, there may be no date data appearing in the The specific code-level design of the browser is readily 

field 30E 5 generated by one of ordinary skill in the arts relating to this 

The ordering of the documents listed in field 30 is a design taven ! ion Md ' accordingly, description is omitted, 

choice, and may be in accordance with, for example, the Referring to FIGS 1, 2 and 7, a collection step 108 and 

"relevance" ordering used by Yahoo. In the particular «*P llfl (for systematically extractmg mforma- 

example shown in FIG. 2, the ordering is according to the b0 ° b °™ us t r - se . le ? ed documents within a HITS record, 

" H , '.vT .u j . r.L u m and placing that information into a convenient, consistent, 

number of occurrences, wilhrn the documents, of the search 10 „„,*1_-_-b.j ,„ i„,„ f „„ „,,. • .■ . , ' „ . , ' 

^xr/^i-ivr*. c ■_• ..I , . . . • , _ , user-specihed template for ready insertion into a template 

CONCEPTS for which the documents were obtained. Refer- report wfll be described. 

™l lo S he u^rtnost document having a label of K M described refcrrin mQ 2 ^ ^ ^ 

r * m» £ ?T ? a r OCCUTK F D ^ S the search results by viewing the ROW CELLS, COLUMN 

of the PAIR CONCEPT £eart, treatment) of any of the CELLS and PAIR CELLS in fields 22, 24 and 28, clicking 

documents listed, namely "3(T as appears in field 30B. 15 r . . t , r : , ' $ 

' J rr on those of interest, scrolhng through the documents and 

Also in step 106 the user can click a MATRIX SPEED reading the contents. The user can now generate a NOTES 

READING button (not shown) on the graphical user inter- document, such as the example in FIG. 7, which is a 

face of FIG. 2 and will be presented with a new graphical template onto which the user pastes information from the 

user interface such as, for example MATRIX SPEED particular document which the user has selected and is 

READING display 500 of FIG. 5, allowing a quick reading 0 reading using, for example, the browser feature described in 

of the documents within the HITS record for any cell in field reference to FIG. 5. The user generates a NOTES document 

30 of FIG. 2. As shown in the example MATRIX SPEED SUCD ^ shown in FIG. 7 by clicking on the "NOTES" screen 

READING display 500 includes a field 502 and a field 504 button 32 in FIG. 2, which generates a pull-down from 

listing, respectively, all of the COLUMN CONCEPTS and wn i c h the user can generate the document 

ROW CONCEPTS used in the search step 104. TWgh a Referring to FIG. 7, each NOTES document 700 is a 

standard interface device such as, for example, a mouse (not summary pro fii e 0 f a particular document obtained by the 

shown) the user highlights one of the ROW CONCEPTS in ^ Tch stcp m ^ jsfQTES document 700 being formatted 

fie £! ^ ^V^^^, ^ V ' C W . the fof for the to enter certain information in specific fields. An 

^^ R ^^Tr^J ( ^ J h^J^ h l htS ODC °f the C °V 30 exam P le formatting is shown in FIG. 7, and consists of a 

UMN CONCEPTS in field 504. In the particular example HEADER 702, a DOCUMENT TEXT field 704, a DOCU- 

s }°™ *? HG - 5 the user has highlighted the ROW CON- MENT IMAGE field 706, and a COMMENT field 708. The 

CEPT "heart and the COLUMN CONCEPT "treatment", HEADER field 702 includes TOPIC field 710, a document 

which presents a scroll list in field 30, as described in DATE field 712, a USER field 714, a CATEGORY field 716, 

reference to FIG. 2, having short descriptions of the docu- a label field 718> a SOURCE field 720, and an 

ments in the record HITS("heart", "treatment"). Field 506 ATTACHED field 722. Some of the HEADER fields 

scrolls through a more detailed description of the documents 710-722 are automatically filled in, using the information in 

listed in field 30. ^ mTs record ^ tfac documcnt 

Referring to FIG. 6, another search example using the include the DATE field 712 and the SOURCE field 720. The 

search matrix graphical user interface 10 of FIG. 2, with a ^ user selects and enters information into the TOPIC FIELD 

further result browser feature 600 of the invention, will be 710, the CATEGORY field 716 and the LABEL field 718, 

described. Field 16 of FIG. 6 lists ten particular ROW preferably using an outline hierarchy for the REPORT step 

CONCEPTS including "@HOLOMEM", which is ROW U2 of FIG. 1, described below, to assemble a plurality of 

CONCEPT(l), and "holograph", which is ROW CONCEPT NOTES into a REPORT. 

5, and "HDSS", which is ROW CONCEPT(IO). As 45 Referring to FIGS. 6 and 7, information from a document 

described in reference to field 328 of the CONCEPT defi- listed in FIG. 6 field 30, and selected and displayed through 

nition interface area 320 in FIG. 3, the ampersand "@" me browser shown by FIG. 6 field 600 is moved into the 

prefix to "HOLOME" in ROW CONCEPT(l) means that the NOTES document of FIG. 7 by a "drag and drop" process 
CONCEPT "HOLOME" is in the CONCEPT list of field^ that, itself,_is wel^known in_the relevant arts._ More— 

322-anoVaccordingly, it is defined according to a Boolean 50 particularly, the user would select, typically by highlighting, 

expression appearing in field 326. text portions from the document appearing in the browser 

Referring to FIG. 6, CONCEPT(5) is "holograph, which display 600, and "drag and drop" those portions into the 

is a self-defined search query term of "holograph" including, DOCUMENT TEXT field 714 of the NOTES document If 

as indicated by the tilda "", a tail of any letter string. the document appearing in the browser display 600 is in 

A shown in FIG. 6, the user has highlighted PAIR 55 HTML format, or another format supporting hyperlinks, the 

CELL(1, 2), corresponding, in the example, to PAIR CON- text appearing in field 714 will have those hyperlinks. If 

CEPT ("@HOLOMEM", "©HOLOGRAM"). The PAIR there are any images using, for example, jpg or .tiff format, 

CELL(1, 2) displays "45", meaning that the number of within the document appearing in the browser display 600 

documents in this HITS record is forty-five. As described in the user can insert these into the DOCUMENT IMAGE field 

reference to FIG. 2, the user's highlighting of PAIR CELL(1, 60 718 - or during the "drag and drop" operations the user 

2) causes a list of the documents identified in HITS can type his or her own comments into the COMMENT field 

("@HOLOME", "©HOLOGRAM") to appear as a scroll 708. 

list in field 30. The user can then highlight any of the In a typical search session using the method of the present 

documents in field 30, whereupon a browser (not shown) invention the user may generate a plurality of, for example, 

retrieves the documents and displays it in field 600. Field 65 ten FIG. 7 NOTES documents 700. 

600 may, depending on the design choice for the browser, be As will be understood, the described method covers the 

an overlay window appearing with the field 30 display. entire knowledge management process of searching, 
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collecting, analyzing, organizing, and reporting, it provides therefore, for this example, the CONCEPT 

the ability to conduct text searches acrossfiles, documents, - "@HQLOMEM" ANDed with the CONCEPT 

Web pages, and databases located anywhere on the user's "@HOLOGRF*. To carry out the search, the method iden- 

personal computer, network, or Internet. The present inven- tifies the leading ampersand of each of the CONCEPTS and, 

tion is not just a search toll; it includes the described set of 5 in response, automatically expands each into its Boolean 

analytical and reporting tools enabling the user to evaluate expression, as described in reference to FIG. 3. The expres- 

the results of searches, organize and create reports. The sion is then re-formatted for loading into field 902 which, for 

described method for searching a relational database(s) or a this example, accommodates the Northern Lights® search 

collection of unrelated documents and text files. engine. 

Search step 104 can be repeated, using the HITS records 10 The example ^ansion is as follows: It is 

identified by any particular COLUMN CELL, ROW CELL assumed that the CONCEPT of "@HOLOMEM was pre- 

or PAIR CELL as the universe of documents searched. The ***** usm £ *>* exa ^P le - H** miC1 ' 

documents can be searched, by applying another set of face of ,™ *' f < °* hol °g™P hl c 

COLUMN CONCEPTS and ROW CONCEPTS to the X?g°£ ^ I 2* ^^^^^ 

results of the first search. This can be repeated as often as the is OR HDSS or ^ HolographicData Storage ^System OR 

user wishes. TTus is commonly referred to as «drilldown'\ optical storage OR spectral recording OR spherical 

_ , , memor/ ). The tilda after "holographic is, as described 

The matrix search parameters, i.e., the CDS database, the ^ ^ d dm h{ h encompasses for example, 

ROW CONCEPTS and COLUMN CONCEPT and the « holo ' graphicar . I( g will also b P e assumed Ll 

HITs r^ 001 ^ of the results can be saved, using file storage "@HOLOGRP" was previously defined as ("IBM" OR 

methods well-known to persons of ordinary skitt in the arts « Univcrsi( y of Dayton" OR "Lucent" OR "Bayer Corp" OR 

relating to this mvention, for later use and shared with other « Rockwelr 0 R "Kodak" OR "Stanford University"). The 

users ' method therefore performs the search shown in FIG. 9 by 

The matrix search parameters, i.e., the GDS database, the ANDing the respective expansions of "©HOLOMEM" and 

ROW CONCEPTS and COLUMN CONCEPTS, and the 25 "@HOLOGRP" and formatting the resulting expression as 

HITs records of the results can be exported/copy filed into required by the Northern Light® search engine, as shown in 

other directories by selecting the result ROW CELLS, field 902. The formatting operation is readily implemented 

COLUMN CELLS and PAIR CELLS. This enables the user in conventional scripting code by one of ordinary skill in the 

to reorganize his data into other directories and also send arts relevant to this invention. 

collections of data to other users. 3Q Referring to FIGS. 6 and 9, the entire matrix of ROW 
The described Collecting step 108 enables the user to CONCEPTS, COLUMN CONCEPTS and PAIR CON- 
store full or partial text (selected by highlighting words) into CEPTS shown in FIG. 6 is searched by formatting each 
digital fifing system collections database, and to save mul- CONCEPT, or PAIR CONCEPT such as, for example, the 
timedia type documents (e.g., .jpeg, .gif, .wav). The individual and PAIR CONCEPTS "@HOLOMEM" and 
described DOCUMENTS TEXT field 704, DOCUMENT 35 @HOLOGRP", performing the search for each, and dis- 
IMAGE field 706 and COMMENT field 708 of the FIG. 7 playing the results in the matrix format of, for example, FIG. 
NOTES documents 700 enables the user to include other 2. The search results are listed in field 904. Each entry (not 
related data into collection fields, store hyperlink back to the labeled) in field 904 has a hyperlink (not labeled) such as, for 
original source. In addition, the format of NOTES docu- example, "HOLOGRAPHIC DATA STORAGE", which is 
ments 700 allows them to be collected as a library (not 40 entry "2", in accordance with the standard format of North- 
shown) for rapid searching, using the HEADER 700 as an era Lights and similar pay-per-document search services. As 
index. Further, the formatting of the NOTES documents known to one of ordinary skill in the arts pertaining to this 
allows step 112 to generate web-ready REPORTS from the invention, clicking on any of the hyperlinks downloads 
collection database. In addition, a Create Link Analysis either a complete document or, as typical with services such 
function of step 112 provides charts/reports that are linked 45 as Northern Lights®, a summary, as shown in FIG. 10. 
back to items in the collection database. These Links reports Referring to FIG. 10, the user typically purchases the 
can be shared with other users. document by clicking on, for example, field 1000. The 
Referring to display 900 of FIG. 9, a further embodiment resulting purchase operation is well known in the art. 

of this invention uses the matrix of ROW CONCEPTS and It _will be understood-that,-typically,- the quick-reading - 

COLUMN CONCEPTS entered, for example, through the so feature described in reference to FIG. 5 will not be available 
FIG. 2 example interface, with a third party search engine. when the matrix search of CONCEPTS interfaces to and uses 
The FIG. 9 example uses the Northern Lights® search pay-per-view search engines. Similarly, the browser feature 
engine to search the Internet using the CONCEPTS of the of FIG. 6 is not available. Referring to FIGS. 6, 7, 11 and 12, 
FIG. 6 example. The CONCEPT matrix search process is in however, the previously described NOTES document gen- 
accordance with the FIG. 6 example, namely searching M 55 eration can be used with typically free summaries provided 
ROW CONCEPTS©, i«l to M, searching N COLUMN by the third-party search services. More particularly, as 
CONCEPTStj), j=l to N, and then all MxN PAIR shown in FIGS. 9, 10 and U, pay-perKiocument search 
CONCEPTS(i, j), i«l to M, j=l to N. An additional opera- services such as, for example, Northern Lights typically 
tion is that the CONCEPTS are converted in to a form download a summary of a document that a user clicks on, 
compatible with, for the FIG. 9 example, the Northern 60 providing enough information to the user to allow a proper 
Lights search engine. More particularly, the field 901 shows decision to purchase. FIG. 11 shows such a summary, which 
that the specific PAIR CONCEPT searched in the FIG. 9 the service such as Northern Lights downloaded to the user 
example is the PAIR CONCEPT ("@HOLOMEM", in response to the user clicking on one of the hyperlinks in 
@HOLOGRP"). As described, the preferred logical opera- FIG. 9 field 904, as described above. Fields 1100 and 1102 
tion for forming a PAIR CONCEPT from a ROW CON- 65 show the title and content of the summary, respectively, after 
CEPT and a COLUMN CONCEPT is the AND operation. being highlighted by the user. Referring to FIG. 12, the 
The PAIR CONCEPT ("@HOLOMEM", ©HOLOGRF') is content of the highlighted fields 1100 and 1102 from FIG. 11 
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are inserted into fields 1202 and 1203, respectively, of the 
generated NOTES document in the manner described io 
reference to FIG. 7. Field 1204 of FIG. 12 contains the URL 
identifier of the summarized document. The user or another 
person can then use the NOTES document shown in FIG. 12 
to retrieve and pay for a complete content of the document. 

While the present invention has been disclosed with 
reference to certain preferred embodiments, these should not 
be considered to limit the present invention. One skilled in 
the art will readily recognize that variations of these embodi- 
ments are possible, each falling within the scope of the 
invention, as set forth in the claims below. 

What is claimed is: 

1. A method for searching electronic text files, comprising 
steps of: 

providing a plurality of electronic text files; 

providing a human interface apparatus for a user to access 

the plurality of electronic text files; 
receiving, at the human interface apparatus, a plurality of 

M first search terms; 
receiving, at the human interface apparatus, a plurality of 

N second search terms; 
receiving a start search command data and, in response, 

performing the following steps: 

(i) performing a key search of the plurality of electronic 
files and generating a first search term hit list asso- 
ciated with each of the M first search term, the first 
search term hit list identifying, for each of the M first 
search terms, the electronic text files from among 
said plurality of electronic text files having said first 
search term, 

(ii) performing a key search of the plurality of elec- 
tronic files and generating a second search term hit 
list associated with each of the N second search 
terms, the second search term hit fist identifying, for 
each of the second search terms, the electronic text 
files from among said plurality of electronic text files 
having said second search term, and 

(iii) identifying MxN unique search term pairs, each 
search term pair representing one first search term 
from each of the plurality of M first search terms and 
one second search term from the plurality of N 
second search terms; and 

(iv) generating a plurality of MxN conjunction hit lists, 
each associated with each key word pair, each con- 
junction hit list identifying the electronic files from 
among the plurality of electronic text files having 

_both the first search term and the second search term_ 
of the key word pair, and identifying said electronic 
files' population count. 

2. A method according to claim 1 further comprising: 
receiving a user-defined search term definition data, said 

definition data including a search term name and a 
Boolean expression of language words associated with 
said search term name; 

displaying said search term name; and 

receiving a user-entered command selecting said displays 
search term and, in response, assigning the Boolean 
expression associated with the said search term as one 
of said first search terms. 

3. A method according to claim 1 further comprising steps 

of 

receiving a search result format command from the user; 
receiving an outline command and, in response generating 
a visual display of a report outline, the report outline 
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having information formatted for comparison with the 
- information field of one or more of said collection 
documents; 

receiving a report generation command from the user and, 
in response, generating a report document based on a 
plurality of said collection documents and formatted in 
accordance with the report outline. 

4. A method for searching electronic files according to 
claim 1, further comprising: 

displaying at least a sub-plurality of said M first search 
terms along a first region of a display screen, said first 
region extending in a first direction along said display 
screen; 

displaying at least a sub-plurality of said N second search 
terms along a second region of a display screen, said 
second region extending in a second direction along 
said display screen, 

displaying, within each of a plurality of regions on said 
display screen, an indicia representing a corresponding 
one of said plurality of conjunction hit lists, each region 
aligned along said first direction with one of said at 
least a sub-plurality of said M first search terms and 
aligned along said second direction with one of said at 
least a sub-plurality of said N second search terms, and 
the indicia within said region representing the conjunc- 
tion hit list associated with the said one of said M first 
search terms and said one of said N second search 
terms. 

5. A method for searching electronic files according to 
claim 4, further comprising: 

receiving a user-entered command identifying one from 
among said regions; 

displaying a representation of at least a sub-plurality of 
the population of files associated with the conjunction 
hit list represented by said indicia within said region. 

6. A method for searching electronic files according to 
claim 5, further comprising: 

receiving a user-entered file-selection command identify- 
ing one from among said population of files associated 
with the conjunction hit list represented by said indicia 
within said region, 

displaying a portion of said file, said portion showing the 
first search term and the second search term of the key 
word pair defined by said first search term and said 
second search term. 

7. A method for searching electronic files according to 
claim 6, further comprising: 

—receiving a -user-entered collection document creation- 
command and, in response, creating a collection file 
and displaying a content field for entering data into said 
collection file; 
receiving a user-entered text select command and, in 
response, copying said portion of said file identified by 
said user-entered file selection command into said 
content field, and inserting a file selection hyperlink 
into said content field to be visibly associated with said 
file from which said portion was copied. 

8. A search method comprising: 
providing a plurality of electronic text files; 
providing a human interface apparatus for receiving 

search terms and commands from a user, and for 
accessing and searching said plurality of files in accor- 
dance with said data and commands; 
receiving, at the human interface apparatus, a first search 
term; 
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receiving, at the human interface apparatus, a second 
search term; 

receiving, at the human interface apparatus, a third search 
term; 

receiving, at the human interface apparatus, a fourth 5 
search term; 

receiving, at the human interface apparatus, a search 
command; 

searching, in response to the search command, the plu- 10 
rality of electronic files and generating a first list, a 
second list, a third list, a fourth list and a fifth list, said 
first list identifying each of said electronic files having 
both the first and third search terms, said second list 
identifying each of said electronic files having both the 15 
first and fourth search terms, a third list, said third list 
identifying each of said electronic files having both the 
second and third search terms, and a fourth list, said 
fourth list identifying each of said electronic files 
having both the second and fourth search terms; 2 0 

displaying, on a video display, said first search term, said 
second search term, said third search term and said 
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fourth search term, a first list identifier representing at 
least a population count of electronic files within said 
first list, a second list identifier representing at least a 
population count of electronic files within said second 
list, a third list identifier representing at least a popu- 
lation count of electronic files within said third list, and 
a fourth list identifier representing at least a population 
count of electronic files within said fourth list. 
9. A method according to claim 8, wherein 
said displaying is performed such that said first list 
identifier is positioned and aligned according to said 
displayed first and third search terms, said second list 
identifier is positioned and aligned according to said 
displayed first and fourth search terms, said third list 
identifier is positioned and aligned according to said 
displayed second and third search terms, and said 
fourth list identifier is positioned and aligned according 
to said displayed second and fourth search terms. 

***** 
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ABSTRACT 



A phrase recognition method breaks streams of text into text 
"chunks" and selects certain chunks as "phrases" useful for 
automated full text searching. The phrase recognition 
method uses a carefully assembled list of partition elements 
to partition the text into the chunks, and selects phrases from 
the chunks according to a small number of frequency based 
definitions. The method can also incorporate additional 
processes such as categorization of proper names to enhance 
phrase recognition. The method selects phrases quickly and 
efficiently, referring simply to the phrases themselves and 
the frequency with which they are encountered, rather than 
relying on complex, time-consuming, resource-consuming 
grammatical analysis, or on collocation schemes of limited 
applicability, or on heuristical text analysis of limited reli- 
ability or utility. 

30 Claims, 6 Drawing Sheets 




05/25/2004, EAST Version: 1.4.1 



5,819,260 

Page 2 



OTHER PUBLICATIONS 

Ahlswede, Thomas, et aL, "Automatic Construction of a 
Phrasal Thesaurus for an Information Retrieval System from 
a Machine Readable Dictionary", Proceedings ofRIAO "88, 
Cambridge, Massachusetts, Mar. 1988, pp. 597-608. 



Chruch, Kenneth Ward, et al. (Bell Laboratories), "A Sto- 
chastic Parts Program and Noun Phrases Parser for Unre- 
stricted Text", Proceedings of 1989 International Confer- 
ence on Acoustics, Speech and Signal Processing (IEEE Cat. 
No. 89CH2673-2), Glasgow, Scotland, UK, May 1989, pp. 
695-698. 



05/25/2004, EAST Version: 1.4.1 



U.S. Patent Oct. 6, 1998 Sheet 1 of 6 5,819,260 





05/25/2004, EAST Version: 1.4.1 



U.S. Patent Oct. 6, 1998 Sheet 2 of 6 



(FK5.4A) 

PARTITIONING TEXT 
INTO CHUNKS USING 
PARTITION LIST 






PHRASE 
BASED ON 


(RG.4B) 

SELECTING 
FREQUENCY 



10 



-3° 



{FIG. 6A-6D) 



OPTIONAL PROCESSING 



1 30 



I 



FIG. 3 



05/25/2004, EAST Version: 1.4 



U.S. Patent Oct. 6, 1998 Sheet 3 of 6 5,819,260 



10 



LOOK UP TEXT IN 
PARTITION WORK LIST 




IOI- 1 


SUBSTITUTING THE MATCHED 
WITH THE PARTITION TAG 




10 2-> 


ADDING ADDITIONAL TAG 




I03 J 


GENERATING ATEXT 
CHUNK LIST 




104^ 


SCANNING & REGENERATING 
-THE LIST-WITH CHUNK FREQ- 



"[" 105 

FIG 4 A 



20 
\ 



PROCESSING LOW CASE WORD 




201^ 


PROCESSING LOW CASE PHRASE 




202-J 


STORING UPPER CASE WORK 




203-> 


. PROCESSING PROPER NAMES 
(> 1 WORD) 




204 J 


DISCARDING OTHER TEXT ITEM 




205-> 


MAPPING LOW CASE SUB-PHRASE 
TO ITS PHRASE 




206 3 


MAPPING SINGLE UPPER CASE 
WORD TO ITS PROPER NAME 


| 207-J 





MAPPING ACRONYMTO 




ITS PHRASE 




20&> 




COMBINING PHRASE LIST 



20« 



FIG 4B 



05/25/2004, EAST Version: 1.4.1 



U.S. Patent 



Oct. 6, 1998 



Sheet 4 of 6 



5,819,260 



PROCESS 



PHRASE 
RECOGNITION 
PROCESS 



PARTITION 
10 



PHRASE 
SELECTION 

20 



~ OPTIONAL 
PROCESSING 

30 



MEMORY 



30&J 



TEXT STREAM(S) 



3K) 



PARTITION LIST 



315 



CHUNK LIST 



320^ 



LOWER CASE SINGLE-WORDS 
(ANY FREQUENCY) 



325-V 



UPPER CASE SINGLE-WORDS 
(ANY FREQUENCY) 



340^ 



PROPER NAME (> 1 WORD) LIST 
(ANY FREQUENCY) 



^330 



P335 



LOWER CASE PHRASES(> 1 WORD) 
(FREQUENCY > 1) 



345 ""bl LOWER CASE PHRASES(> 1 WORD) | | 



(FREQUENCY « 1) 



350^1 



ACRONYMS 



375-N- 



380 



SYNONYM 
THESAURUS 



385A1 
385E^ 



PHRASE FREQUENCY IN 

COLLECTION OF-DOGUMENTS 

(FREQUENCY IN COLLECTION > 5) 



SPECIAL INDICATORS/NAMES 



SPECIAL NAMES 



39 OE 



FIG. 5 



05/25/2004, EAST Version: 1.4.1 



U.S. Patent Oct. 6, 1998 Sheet 5 of 6 5,819,260 



1 

CONSOLIDATING PHRASE USING 
SYNONYM THESAURUS 



{ 301 

FIG 6A 





r 


PARTITIONING ON PREPOSITION 






302 




MAPPING SUB-PHRASE 
TO ITS PHRASE 






1 


\ 

303 






FIG 


6B 






— 1 







TRIMMING PHRASE 



FIG. 6C 



I 



CATEGORIZING : COMPANY 




(-305 


CATEGORIZING : GEO. NAME 




(-306 


CATEGORIZING : PRODUCT 




^-307 


CATEGORIZING: 


ORGANIZATION 




6308 






CATEGORIZING :PEOPLE 



J C309 
FIG 6D 



05/25/2004, EAST Version: 1.4.1 



U.S. Patent 



Oct. 6, 1998 



Sheet 6 of 6 



5,819,260 



BEGIN 
OLS FLOW 
t 

DOCUMENT 
BROWSE - 
ENTER-MORE- 
COMMAND 



RETRIEVE 
DOCUMENT 



L 702 



PHRASE 
EXTRACTION 



W03 



DISPLAY 
ERROR 
MESSAGE 

"^707 



VALIDATE 
CANDIDATE 
PHRASES 




•706 



ADD NEW TERMS 
AND PHRASES TO 
EXISTING SEARCH 
TERM 



08 



BEGIN DISTRIBUTED 
DEVELOPMENT SYSTEM 

FLOW 



DOCUMENT 
FILTER PER 
CONTROL FILE 



"^723 



PHRASE 
EXTRACTION 



T724 



SORT AND 
CALCULATE 
PHRASE 
FREQUENCIES 




DISCARD 



YES 



-727 



726-' 



INCLUDE IN 
REVISED LIST 



0728 



TRANSFER & 
BUILD PHRASE 
DICTIONARY 



C-729 



DISPLAY 
NEW 

i SEARCH OPTIONS j 



FIG 7B 



W09 



BEGIN 
BATCH -MAINFRAME 
FLOW 



FIG 7A 



FILTER DOCUMENTS 
PER CONTROLFILE 



W33 



PHRASE 
EXTRACTION 



FIG 7C 



M-34 



05/25/2004, EAST Version: 1.4.1 



5,819,260 

1 2 

PHRASE RECOGNITION METHOD AND of-speech tag? or a given pair of words tends to appear 

APPARATUS together in a given data collection. When a word has more 

than one part -of-speech tag associated with it in a dictionary, 

COPYRIGHT NOTICE: A portion of the disclosure consulting the part of speech of the next word and calcu- 

(including all Lists) of this patent document contains mate- 5 lating the probability of occurrence of the two tags would 

rial which is subject to copyright protection. The copyright help select a tag. Similarly, a pair of words that often appear 

owner has no objection to the facsimile reproduction by together in the collection is probably a phrase. However, 

anyone of the patent document or the patent disclosure as it statistical text analysis requires knowledge of collocation 

appears in the U.S. Patent and Trademark Office patent file that can only be derived from an known data collection, 

or records, but the copyright owner reserves all other copy- 10 Disadvantageous^, the method is not suitable for processing 

right rights whatsoever. unknown data. Regarding statistical text analysis, reference 

may be made to U.S. Pat. Nos. 5,225,981, 5,146,405, and 

BACKGROUND OF THE INVENTION 4,868,750. 

1 Field of the Invention Tn e t ^ rc ^ method of analysis, heuristical text analysis, 

The present invention relates to automated indexing of " ^fffff* t6XtUal P h f f » P*«°™ 

full-text documents to identify the content-bearing terms for ^ <™&™ bl ° text > hat W*™ 1 ™<*P*- 

i . j . ' c ii *i_ • such as company names, peoples names, or product names, 

later document retrieval. More specifically, the invention c i r . * • . r , , J. 

l4 . 4 4 , ;. c L For example, a hst of capital words followed by a company 

relates to computer-automated identification of phrases •.^•^♦_ri ' «r ■ ~ «^ • ; .* *J 

. . . , i . « , \ A r mdicator like Limited or Corp is an example pattern for 

which are useful in representing the conceptual content of ™ • • • » r«_ , r .v. , 

, • r • j • j i 20 recognizing company names m text. The heuristical text 

documents for indexing and retrieval. , . , r . 4 , . c 

& analysis method requires strong observation ability from a 

2. Related Art human analyst. Due to the limitation of humans' observation 
Atype of content-bearing term is the "phrase", a language span, heuristical text analysis is only feasible for small 
device used in information retrieval to improve retrieval subject domains (e.g., company name, product names, case 
precision. For example, the phrase "product liability" indi- 25 document names, addresses, etc.). Regarding heuristical text 
cates a concept that neither of the two component words can analysis, reference may be made to U.S. Pat. Nos. 5,410, 
fully express. Without this phrase, a retrieval process is 475, 5,287,278, 5,251,316, and 5,161,105. 
unable to find the documents in which the concept is Thus, machme translation methods, involving potentially 
discussed. complex grammatical analysis, are too expensive and too 
In traditional Boolean retrieval systems, phrase recogni- 30 error-prone for phrase recognition. Statistical text analysis, 
tion is not an issue. The systems are known as post- being based on collocation and being purely based on 
coordination indexing systems in that phrases can be dis- statistics, is still expensive because of the required full scale 
covered through examining the adjacency relationships of part-of -speech tagging and pre-calculating collocation 
among search words during the process of merging inverted information, and also has difficulties processing unknown 
lists associated with the words. 35 data without the collocation knowledge. Finally, heuristical 
However, in modern information retrieval systems, the text analysis, relying on "signal terms", is highly domain- 
statistical distribution characteristics of index terms are dependent and has trouble processing general texts, 
crucial to the relevance ranking process, and it is desirable Thus, there is a need in the art for a simple, time-efficient, 
to recognize phrases and derive their statistical characteris- resource-efficient, and reliable phrase recognition method 
tics in advance. In addition, in fabricating hypertext for use in assisting text indexing or for forming a statistical 
databases, recognized phrases are necessary for hypertext thesaurus. It is desired that the phrase recognition method be 
linkage. applicable to both individual documents and to large col- 
Known phrase recognition methods include three types: lections of documents, so that the performance of not only 
machine translation, statistical text analysis and heuristical 4S real-time on-line systems, but also distributed and main- 
text analysis. frame text search systems can be improved. It is also desired 
First, machine translation's approach to recognizing uat tne phrase recognition method have engineering 
phrases (known as compound structures) is to analyze scalability, and not be limited to particular domains of 

part-of-speech tags, associated _with_the words in an-input knowledge — — 

text string, usually a sentence. Noun phrases and verb 50 The present invention is directed to fulfilling these needs. 

phrases are two examples of such phrases. Syntactical „ , Mr T _ 

context and lexical relationships among the woids are key SUMMARY OF THE INVENTION 
factors that determine successful parsing of the text. In The present invention provides a phrase recognition 
machine translation, the goal is not of finding correct method which breaks text into text "chunks" and selects 
phrases, but of discovering the correct syntactical structure 55 certain chunks as "phrases" useful for automated full text 
of the input text string to support other translation tasks. It searching. The invention uses a carefully assembled list of 
is infeasible to use this syntactical parsing method for partition words to partition the text into the chunks, and 
processing commercial full-text databases; the method is selects phrases from the chunks according to a small number 
inefficient and is in practical terms not scalable. Regarding of frequency-based definitions. The invention can also incor- 
machine translation, reference may be made to U.S. Pat. 60 porate additional processes such as categorization of proper 
Nos. 5,299,124, 5,289,375, 4,994,966, 4,914,590, 4,931, names to enhance phrase recognition. 
936, and 4,864,502. The invention achieves its goals quickly and efficiently, 
The second method of analysis, statistical text analysis, referring simply to the phrases and the frequency with which 
has two goals: disambiguating part of speech tagging, and they are encountered, rather than relying on complex, time- 
discovering noun phrases or other compound terms. The 65 consuming, resource -consuming grammatical analysis, or 
statistics used include collocation information or mutual on collocation schemes of limited applicability, or on heu- 
information, i.e., the probability that a given pair of part- ristical text analysis of limited reliability or utility. 
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Additional objects, features and advantages of the inven- Very briefly, a preferred embodiment of the inventive 

tion will become apparent when the following Detailed method partitions an input text stream based on punctuation 

Description of the Preferred Embodiments is read in con- and vocabulary. As the method processes the text stream 

junction with the accompanying drawings. sequentially, it inserts partition symbols between words if 

5 certain punctuation exists, such as a comma, end of 

BRIEF DESCRIPTION OF THE DRAWINGS sentence, or change in capitalization. Further, each word 

The invention is better understood by reading the follow- encountered js checked against one or more vocabulary lists, 

ing Detailed Description of the Preferred Embodiments with * nd may * discarded and replaced by a partition symbols, 

reference to the accompanying drawing figures, in which based on the word and where lt 15 encountered, 

like reference numerals refer to like elements throughout, 10 MicT ±c document is thus processed, a set of candidate 

and in which* terms and "phrases" (a series of non-partitioned words) is 

FIG. 1 illustrates an exemplary hardware configuration on V 0 *"^. 1 * P^™*, embodiment soUtary words 

which the inventive phrase recognition method may be M™f» words immediately surrounded by partitions) are 

executed ignored at this point. The phrases are processed to determine 

n ^ ' .„ , , , . 15 which phrases occur with higher frequency. Preferably, 

FIG. 2 illustrates another exemplary hardware environ- shorter phrases which oocur ^ hi her frequencv and ^ 

ment in which the inventive phrase recognition method may subsets of i ower -frequency but lengthier phrases are also 

be practiced. sought. A set of phrases meeting or exceeding a given 

FIG. 3 is a high level flow chart schematically indicating threshold frequency is produced, 

execution in an embodiment of the phrase recognition 20 inventive method is more easily understood with 

method according to the present invention. reference to a particular example. 

FIG. 4Ais a flow chart schematically indicating execution ^ mentioned above, members of a list of words 

in a module for partitioning text and generating text chunks. (including punctuation) serve as "break points" to form text 

FIG. 4B is a flow chart indicating execution of a module "chunks" within input text. A first (rudimentary) list includes 

for selecting phrases using the data memory structure dia- 25 words can be used as "stop words**. The stop words usually 

gram of FIG. 5. carry little semantic information because they exist merely 

FIG. 5 is a data memory structure diagram schematically for various language functions. This list has a few hundred 

illustrating data flow during the inventive phrase recognition members and includes articles (e.g., "a", "the"), conjunc- 

method (FIGS. 3, 4A, 4B) and corresponding memory 30 tions (e.g., "and", "or"), adverbs (e.g., "where", "why"), 

allocation for various types of data used in accordance with prepositions (e.g., "of", "to", "for"), pronouns (e.g., "we", 

the process. "his"), and perhaps some numeric items. 

FIG. 6A is a flow chart of an optional processing module However, this first list is too short for the present phrase 

for consolidating with a thesaurus. recognition method because it causes generation of a list of 

FIG. 6B is a flow chart of an optional processing module 35 J* chunks that are too long to allow efficient generation of 

for processing phrases with prepositions. ~ desirablc Phases. Additional stop woros or other partition 

™^ ^ • , / . items are needed for reducing the size of the text chunks, so 

FIG. 6C is a flow chart of an optional processing module mat more desirable phrases may ^ found . 



for trimming phrases with their collection frequencies. 



The following example of text illustrates this problem. In 



FIG. 6D is a flow chart of an optional processing module ^ this example, me text "chunks" are wimin square brackets, 

for categorizing proper names. ^ me text chunks being panted by an mem bers of the 

FIGS. 7A, 7B and 7C illustrate exemplary applications of list of stop words (break points): 

the inventive phrase recognition method according to the [Citing] what is [called newly conciliatory comments] by 

present invention. In particular: FIG. 7A indicates a user's mc [i ca der] of the [Irish Republican Army]'s [political 

viewing of a document in accordance with a suitable on-line 45 mc [Clinton Administration announced today] 

text search system, and invoking the inventive phrase rec- that it would [issue -| ^ m a [visa ] to [ atten d] a 

ognition method to search for additional documents of [conference] on [Northern Ireland] in [Manhattan] on 

similar conceptual content; FIG. 7B schematically illustrates [Tuesday]. The [Administration] had been [leaning] 

implementation of the phrase ^ reOTgnition_method_in a batch againstfissuing] the [visa]" to" the [orficial]r[Gerry~ 

phrase recognition system in a distributed development 5Q Adams], the [head] of [Sinn Fein], [leaving] the [White 

system; FIG. 7C schematically illustrates application of the H ouse caught] between the [British Government] and a 

inventive phrase recognition method in a batch process in a [powerful bloc] of [Irish-American legislators] who 

mainframe system. [favored] the [visa]. (Parsed text based on rudimentary 

DETAILED DESCRIPTION OF THE c . ^ . . . , . . . . , 

PREFERRED EMBODIMENTS 55 Smoe desirable phrases include noun phrases (e.g., "ice 

cream % adjective-noun phrases (e.g., high school ), 

In describing preferred embodiments of the present inven- participle-noun phrases (e.g., "operating system"), and 

tion illustrated in the drawings, specific terminology is proper names (e.g., "White House"), it is safe to add adverbs 

employed for the sake of clarity. However, the invention is (e.g., "fully") and non-participle verbs (e.g., "have", "is", 

not intended to be limited to the specific terminology so g 0 "obtain") to the list of stop words to form an enhanced stop 

selected, and it is to be understood mat each specific element word list. This enhanced stop word list allows the method to 

includes all technical equivalents which operate in a similar provide smaller text chunks, yet is still compact enough for 

manner to accomplish a similar purpose. efficient look-up by computer. With the enhanced list, the 

The concept of the present invention is first described on above example text is parsed into chunks and stop words as 

a particular example of a text stream. Then, block diagrams 65 follows: 

and flow charts are described, which illustrate non-limiting [Citing] what is [called] newly [conciliatory comments] 

embodiments of the invention's structure and function. by the [leader] of the [Irish Republican Army] J s 
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[political wing], the [Clinton Administration] 
announced [today] that it would issue him a [visa] to 
attend a [conference] on [Northern Ireland] in 
[Manhattan] on [Tuesday], The [Administration] had 
been [leaning] against [issuing] the [visa] to the 
[official], [Gerry Adams], the [head] of [Sinn Fein], 
[leaving] the [White House] caught between the 
[British Government] and a [powerful bloc] of [Irish- 
American legislators] who favored the [visa], (Second 
parsed text based on enhanced list) 
The theoretical justification of using this enhanced list 
derives from two sources. 

A first justification is that this list only represents about 
13% of unique words in a general English dictionary. For 
example, in the Moby dictionary of 214,100 entries, there 
are 28,408 words that can be put into the list. This fact 
ensures that semantic, information in texts is maintained at 
a maximum level. 

A second justification involves the lexical characteristics 
of these words. Most of the words bear little content. This 
second fact reduces the risk of losing semantic information 
in the text. 

The basic concept of the invention having been described, 
particular implementations of its structure and function are 
now presented. 

As will readily be appreciated, the invention is preferably 
embodied as software, instruction codes capable of being 
executed by digital computers, including commercially 
available general purpose digital computers well known to 
those skilled in the art. The particular hardware on which the 
invention may be implemented varies with the particular 
desired application of the inventive phrase recognition 
method. Three examples of the such application of the 
phrase recognition method are described in greater detail 
below, with reference to FIGS. 7 A, 7B, and 7C. Briefly, the 
dynamic recognition method involved in an on-line system 
(FIG. 7 A) may be implemented in IBM 370 assembly 
language code. Alternatively, in a batch recognition system 
in a distributed development system (FIG. 7B), the phrase 
recognition method may be implemented on a SUN work 
station using the PERL script interpretive prototyping lan- 
guage. As a still further implementation, the inventive 
phrase recognition method may be implemented on an 
Amdahl AMD 5995-1400-a mainframe so that another batch 
phrase recognition system (FIG. 7C) may be realized. Of 
course, the scope of the invention should not be limited by 
these exemplary embodiments or applications. 

Embodiments of the inventive phrase recognition method 
may beimplemented-as-a-software program including a 
series of executable modules on a computer system. As 
shown in FIG. 1, an exemplary hardware platform includes 
a central processing unit 110. The central processing unit 
interacts with a human user through a user interface 112. The 
user interface is used for inputting information into the 
system and for interaction between the system and the 
human user. The user interface 112 includes, for example, a 
video display 113 and a keyboard 115. A computer memory 
114 provides storage for data and software programs which 
are executed by the central processing unit 110. Auxiliary 
memory 116, such as a hard disk drive or a tape drive, 
provides additional storage capacity and a means for retriev- 
ing large batches of information. 

All components shown in FIG. 1 are of a type well known 
in the art. For example, the FIG. 1 system may include a 
SUN® work station including the execution platform Sparc 
2 and SUN OS Version 4.1.2., available from SUN MICRO- 
SYSTEMS of Sunnyvale, Calif. Of course, the system of the 
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present invention may be implemented on any number of 
modern computer systems. 

A second, more complex environment in which the inven- 
tive phrase recognition method may be practiced is shown in 

5 FIG. 2. In particular, a document search and retrieval system 
30 is shown. The system allows a user to search a subset of 
a plurality of documents for particular key words or phrases. 
The system then allows the user to view documents that 
match the search request The system 30 comprises a plu- 
rality of Search and Retrieval (SR) computers 32-35 con- 
nected via a high speed interconnection 38 to a plurality of 
Session Administrator (SA) computers 42-44. 

Each of the SR's 32-35 is connected to one or more 
document collections 46-49, each containing text for a 
plurality of documents, indexes therefor, and other ancillary 

15 data. More than one SR can access a single document 
collection. Also, a single SR can be provided access to more 
than one document collection. The SR's 32-35 can be 
implemented using a variety of commercially available 
computers well known in the art, such as Model EX100 

20 manufactured by Hitachi Data Systems of Santa Clara, Calif. 
Each of the SA's 42-44 is provided access to data 
representing phrase and thesaurus dictionaries 52-54. The 
SA's 42-44 can also be implemented using a variety of 
commercially available computers, such as Models 5990 

25 and 5995 manufactured by Amdahl Corporation of Sunny- 
vale Calif. The interconnection 38 between the SR's and the 
SA's can be any one of a number of two-way high-speed 
computer data interconnections well known in the art, such 
as the Model 7200-DX manufactured by Network Systems 

30 Corporation of Minneapolis, Minn. 

Each of the SA's 42-44 is connected to one of a plurality 
of front end processors 56-58. The front end processors 
56-58 provide a connection of the system 30 one or more 
commonly available networks 62 for accessing digital data, 

35 such as an X.25 network, long distance telephone lines, 
and/or SprintNet. Connected to the network 62 are plural 
terminals 64-66 which provide users access to the system 
30. Terminals 64-66 can be dumb terminals which simply 
process and display data inputs and outputs, or they can be 

40 one of a variety of readily available stand-alone computers, 
such as IBM or IBM^compatible personal computers. The 
front end processors 56-58 can be implemented by a variety 
of commercially available devices, such as Models 4745 and 
4705 manufactured by the Amdahl Corporation of Sunny- 

45 vale Calif. 

The number of components shown in FIG. 2 are for 
illustrative purposes only. The system 30 described herein 
can have any number of SA's, SR's, front end processors, 
etc:~Also7 the disu^ution^f^roc^mg^escribed herein 
50 may be modified and may in fact be performed on a single 
computer without departing from the spirit and scope of the 
invention. 

A user wishing to access the system 30 via one of the 
terminals 64-66 will use the network 62 to establish a 

55 connection, by means well known in the art, to one of the 
front end processors 56-58. The front end processors 56-58 
handle communication with the user terminals 64-66 by 
providing output data for display by the terminals 64-66 and 
by processing terminal keyboard inputs entered by the user. 

60 The data output by the front end processors 56-58 includes 
text and screen commands. The front end processors 56-58 
support screen control commands, such as the commonly 
known VT100 commands, which provide screen function- 
ality to the terminals 64-66 such as clearing the screen and 

65 moving the cursor insertion point. The front end processors 
56-58 can handle other known types of terminals and/or 
stand-alone computers by providing appropriate commands. 
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Each of the front end processors 56-58 communicates 
bidirectionally, by means well known in the art, with its 
corresponding one of the SA's 42-44. It is also possible to 
configure the system, in a manner well known in the art, 
such that one or more of the front end processors can 
communicate with more than one of the SA's 42-44. The 
front end processors 56-58 can be configured to "load 
balance" the SA's 42-44 in response to data flow patterns. 
The concept of load balancing is well known in the art. 

Each of the SA's 42-44 contains an application program 
that processes search requests input by a user at one of the 
terminals 64-66 and passes the search request information 
onto one or more of the SR's 32-35 which perform the 
search and returns the results, including the text of the 
documents, to the SA's 42-44. The SA's 42-44 provide the 
user with text documents corresponding to the search results 
via the terminals 64—66. For a particular user session (i.e. a 
single user accessing the system via one of the terminals 
64-66), a single one of the SA's 42-44 will interact with a 
user through an appropriate one of the front end processors 
56-58. 

Preferably, the inventive phrase recognition method is 
implemented in the session administrator SA computers 
42-44, with primary memory being in the SA computer 
itself and further memory being illustrated within elements 
52-54. 

The principles on which the inventive method is based, 
and hardware systems and software platforms on which it 
may be executed, having been described, a preferred 
embodiment of the inventive phrase recognition method is 
described as follows. 

FIG. 3 is a high level flow diagram of the phrase recog- 
nition method of the preferred embodiment. 

Referring to FIG. 3, the invention uses a carefully 
assembled list of English words (and other considerations 
such as punctuation) in a Partition Word List (more gener- 
ally referred to as a Partition Entity List, or simply Partition 
List) to partition one or more input text streams into many 
text chunks. This partitioning process is illustrated in block 
10. 

Block 20 indicates selection of phrases from among the 
chunks of text, according to frequency based definitions. A 
Phrase list, including the selected phrases, results from 
execution of the process in block 20. During the phrase 
selection process, solitary words (single-word chunks), as 
well as words from the decomposed phrases, can be main- 
tained separate from the Phrase List as optional outputs for 
other indexing activities. 

Details of processes 10 and 20 are described with refer- 
ence to FIGS. 4A and 4B. 

_The_invention. can_optionaUy_incorporate one-or-more- 
other processes, generically indicated as element 30. Such 
optional process may include categorization (examples 
described with reference to FIGS. 6A-6D) to enhance the 
list of recognized phrases. 

FIG. 4A is a flow chart of FIG. 3 module 10, for 
partitioning text and generating text chunks. 

FIG. 4A shows how the method, given a piece of text, 
partitions the text into many small text chunks. A critical 
component in this method is the Partition List (including 
words and punctuation) whose members serve as break 
points to generate the text chunks. 

As mentioned above, a Partition List ideally allows pars- 
ing of text into short phrases, but is itself still compact 
enough for efficient computer look-up during the parsing 
process. Preferably, the Partition List is generated using not 
only articles, conjunctions, adverbs, prepositions, pronouns, 
and numeric items, but also adverbs and verbs, to form an 
enhanced list. 
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The text partitioning process starts off with looking up 
encountered text in the Partition List (at block 101) and 
replacing every matched partition word or other partition 
entity with a partition tag such as "####" (shown at block 
5 102). 

Additional partition tags are added into those text chunks 
at the point where there is a case change, either from lower 
case to upper case or vice versa (shown at block 103). Block 
104 indicates generation of the text chunk list which pre- 
10 serves the natural sequence of the chunks as encountered in 
the text. 

The frequency information for each chunk in the list is 
collected by scanning the text chunks in their natural 
sequence. The first occurrence of each unique chunk in the 

15 sequence is registered as a new entry with its frequency as 
1. Subsequent occurrences are registered by incrementing its 
frequency count by 1. This generation of occurrence fre- 
quencies in association with the respective chunks is indi- 
cated by block 105. 

20 FIG. 4B is a flow chart of FIG. 3 module 20, illustrating 
details of a preferred process for selecting which chunks are 
phrases. 

FIG. 5 is a data memory structure diagram showing how 
data may be arranged in memory for the process, and how 

25 data flows into and out of various steps of the process. More 
specifically, the steps from FIG. 3 of text partitioning 10, 
phrase selection 20, and optional processing 30 (reproduced 
on the left side of FIG. 5) are illustrated in conjunction with 
an exemplary data memory structure diagram (on the right 

30 side of FIG, 5) to schematically illustrate data flow between 
major functional procedures and data structures. The various 
lists shown in the exemplary memory blocks on the right 
side of FIG. 5 are understood to include list members in 
conjunction with their respective frequencies of occurrence. 

35 The memory (broadly, any data storage medium such as 
RAM and/or magnetic disk and/or optical disk and/or other 
suitable computer readable medium) may be structured in 
memory blocks as schematically illustrated. A text stream 
file 300 and a Partition list 310 are used as inputs to the 

40 partitioning process 10 of the inventive phrase recognition 
method. The partitioning process 10 provides a chunk list 
(understood as including corresponding chunk frequencies) 
315. Chunk list 315 is used by the phrase selection process 
20 of the inventive phrase recognition method. 

45 The partitioning process produces various groupings of 
chunks, each with their respective frequencies of occurrence 
within the text stream. These groupings of chunks are 
illustrated on the right side of FIG. 5, with the understanding 

- — that- the invention "should~not be "limited to The particular 

50 memory structure so illustrated. 

Specifically, lower case words (that is, single-word 
chunks) are in memory block 320, capitalized or "allcaps" 
single-word chunks are in memory block 325, a Proper 
Name List (preferably of greater than one word, each being 

55 capitalized or in allcaps) is in memory block 330, lower case 
phrases of greater than one word occurring more than once 
are in memory block 335, lower case phrases of greater than 
one word which were encountered only once are in memory 
block 345, and, optionally, acronyms are in memory block 

60 350. 

A synonym thesaurus in memory block 375 may be used 
in an optional process 30. A phrase frequency list derived 
from a collection of plural documents in which the phrase 
frequency throughout the collection is greater than a 
65 threshold, in memory block 380, may also be used in an 
optional processing procedure 30. Further, one or more 
special indicator lists, generally indicated as 385A-385E 
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(company indicators, geographic names, product names, quency count. This use of upper case single words in 

organization indicators, English first names, respectively, memory block 325 to revise the Proper Name list 330 is 

some of which are exemplified in the attached List) may indicated at block 207. The method stores the other upper 

contribute to certain optional categorizing processes, and case words in the upper case word list 325 as an optional 

result in corresponding name lists (company names, geo- 5 output. 

graphic location names, product names, organization names, A special case of the singleton upper case word is that of 

and English names) generally indicated as 390A-390E. the acronym. An acronym is defined either as a string of the 

Referring again to FIG. 4B, after the text chunk list is first character of each word (which is neither a preposition 

produced, it is the time to make decision whether each chunk nor a conjunction) in a proper name or as a string of the first 

in the list is a phrase useful for representing conceptual to character of each word in a proper name followed by a 

content of documents. The inventive phrase recognition period. As indicated at block 208, when an acronym in 

method uses the frequency information of two types of the memory block 325 maps to a proper name in the proper 

partitioned text chunks (namely, the proper names in block name list 330, the frequency count of the proper name is 

330 and the lower case phrases in blocks 335 and 345) to incremented, and the pair of the proper name and its 

make final phrase selection decisions. Preferably, the inven- 15 acronym is copied into an acronym list 350 as an optional 

tion focuses on lower case phrases of more than one word, output. 

or on proper names ("John Hancock", "United States"). In our example, this reference checking process further 

Referring to FIGS. 4B and 5, at block 201, entries reduces the proper name list in this example to the follow- 

consisting of a solitary lower case word are not selected as ing: 

phrases. Rejected entries are stored in memory block 320. 20 [Irish Republican Army, 1] 

As shown at block 202, those chunks that include plural [Clinton Adrninistration, 2] 

lower case words are determined to be phrases only if they [Northern Ireland, 1] 

occur at least twice in the text stream. These chunks are [Gerry Adams, 1] 

stored in memory block 335. Chunks not fitting these criteria [Sinn Fein, 1] 

are stored in block 345 for further processing. 25 [White House, 1] 

For chunks consisting of a solitary upper case word [British Government, 1] 
(either the first letter being capitalized or "allcaps"), no If no additional processing is necessary, this method con- 
phrase decision is made at this stage, as shown at block 203. eludes by combining the lower case phrase list in memory 
Such chunks are stored in memory block 325. block 335 and the Proper Name List in memory block 330 

In block 204, chunks including plural upper case words 30 into a single Phrase List 340 which is provided as the final 

are determined to be proper names and are stored in a Proper output of the phrase selection process 20. 

Name List in memory block 330. In another embodiment, the lower case phrases with 

Finally, other text chunks not fitting the previous criteria frequency^!, in memory block 345 are also included in the 

are simply discarded at this time, as indicated at block 205. consolidation, in addition to the Proper Name List in 

Next, block 206 examines the lower case phrases having 35 memory block 330 and the lower case phrases having 

a single occurrence from memory block 245. They are frequency greater than 1 in memory block 335. The choice 

examined for having one of its sub-phrases as part of an of either including or excluding the lower case phrases in 

existing lower case phrase in the list. For efficiency, a memory block 345 is determined by a frequency threshold 

sub-phrase may be defined to be the first or last two or three parameter which determines the number of times a lower 

words in the phrases. When the existence of a sub-phrase is 40 case phrase must be encountered before it is allowed to be 

detected, it is merged into the corresponding phrase in the consolidated into the final Phrase List, 

list in memory block 335, and its frequency count is updated. The example shown in FIG. 5 has this threshold set to 2, 

Otherwise, and the lower case phrase is decomposed into so that those phrases encountered only once (in memory 

individual words for updating the lower case word list in block 345) are excluded from the consolidated Phrase List 

memory block 320 as an optional output. 45 340. The dotted line extending downward from Phrase List 

As a result of this sub-phrase mapping in block 206, in our 340 to include memory block 345 shows how lower case 

example the list is reduced to a list of lower case phrases and phrases encountered only once can be included in the Phrase 

a list of proper names, both with their respective frequency List if desired, however. 

counts: In'any "event; the Wn^hdado^of memor^blbcks imo a 

[political wing, 2] 50 single Phrase List is indicated at block 209. 

[Citing, 1] For this text stream example, the final Phrase List is as 

[Irish Republican Army, 1] follows: 

[Clinton Administration, 1] [political wing, 2] 

[Northern Ireland, 1] [Irish Republican Army, 1] 

[Manhattan, 1] 55 [Clinton Adrninistration, 2] 

[Tuesday, 1] [Northern Ireland, 1] 

[Administration, 1] [Gerry Adams, 1] 

[Gerry Adams, 1] [Sinn Fein, 1] 

[Sinn Fein, 1] [White House, 1] 

[White House, 1] 60 [British Government, 1] 

[British Government, 1] The invention envisions that optional processes are available 

The singleton upper case word could be used for referencing for further enhancing the recognized phrases. 

an existing proper name in the proper name list. To make the FIG. 6A is a flow chart of an optional processing module 

final frequency count accurate, the method makes one addi- for consolidating with a synonym thesaurus. 

tional scan to the Proper Name List 330. It consolidates the 65 Referring to FIG. 6A, the Phrase List can be further 

upper case word that is either the first or the last word of a reduced with a synonym thesaurus, as indicated at block 

previously recognized proper name, and updates its fre- 301. The synonym thesaurus may be any suitable synonym 
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thesaurus available from commercial vendors. As an the list that does not have the indicator word. If the search 

example, the Phrase List may map "White House" to "Clin- is successful, the frequency count of the company name is 

ton Administration." Using a synonym thesaurus is risky updated. The recognized company names are kept in a 

because its contents may not reflect the intended conceptual Company Names list 390A as an optional output, as indi- 

content of the text, and therefore may cause mapping errors. 5 ca ted at block 305. 

For example, it would be problematic if a synonym thesau- Similarly, a list 385B of geographic names or a list 385C 

rus maps "Bill Clinton" to "White House", because the two of product Qamcs may bc used for looking up whcthcr a 

te T/ r ^ n0t al ^ ayS ^ mv f ent - . , . A% proper name has a match and thereafter for categorizing it 

FIG. 6B is a flow chart of an optional processing module mto ^ ^ of hic names or a ^ of ^ n J 

for processing phrases with prepositions. _ 4 . , • , ,. , ' 

Referring to FIG. 6B, when a desirable lower case phrase 10 ^pecUvely^e recognized geographic na^es or product 

contains one of a small set of prepositions (e.g., "right to ^ « kept m Geographic location Names 390B or 

counsel", "standard of proof ), the method takes the sttout P^uct Names 390C lists as optional outputs, as indicated 

of the Partition List used for generating text chunks so that at olocks 306 and 307 - 

the phrase including the preposition has an opportunity to A ta 385D of words that designate organizations is used 

reveal itself as being part of a "good" phrase. This process 15 for determining whether the first or the last word of a proper 

is indicated as block 302. name is mc indicator of organization, and thereafter for 

Since it is statistically unlikely that any given occurrence categorizing it into the group of organizations. The recog- 

of a preposition is in a "good" phrase, this optional process nized organization names may be kept in an Organization 

consumes substantial time for a relatively small increase in Names List 390D as an optional output, as indicated at block 

phrases, and is considered optional. 20 308. 

It is necessary to have another process to further examine Finally a list 385E of English first names is used for 

the unqualified phrase in memory block 345 that contains determining whether the first word of a proper name is a 

one of the selected prepositions, whether the sub-phrase on popular first name and thereafter for categorizing it into the 

the left of the preposition or the sub-phrase on the right group of peoples' names. Any word before the first name is 

constitutes a valid phrase in the lower case phrase list in 25 removed from the proper name. The more comprehensive 

memory block 335. This process is illustrated as block 303. the lists are, the more people names can be categorized 

As a result of process blocks 302, 303, memory block 335 properly. The recognized people names are kept in a separate 

may be updated. English Names list 390E as an optional output for other 

FIG. 6C is a flow chart of an optional processing module indexing activities, as indicated at block 309. 

for trimming phrases with their collection frequencies. 30 Appendices A through E present an exemplary Partition 

Referring to FIG. 6C, still another optional process is that List 310 and exemplary Special Indicator/Name lists 

of editing the list of the Proper Name List 330 and lower 385A-385E. 

case phrases 335 with additional frequency information 380 The inventive method having been described above, the 

gathered from a text collection of more than one document. invention also encompasses apparatus (especially program- 

The assumption here is that, the more authors which use a 35 mable computers) for carrying out phrase recognition, 

phrase, the more reliable the phrase is for uniquely express- Further, the invention encompasses articles of manufacture, 

ing a concept. In other words, a phrase occurring in more specifically, computer readable memory on which the 

than one document is a "stronger" phrase than another computer-readable code embodying the method may be 

phrase occurring only in a single document stored, so that, when the code used in conjunction with a 

Here, the "collection frequency" of a phrase is the number 40 computer, the computer can carry out phrase recognition, 
of documents that contain the phrase. A collection frequency Non-limiting, illustrative examples of apparatus which 
threshold (e.g., five documents) can be set to trim down invention envisions are described above and illustrated in 
those phrases whose collection frequencies are below the FIGS. 1 and 2. Each constitutes a computer or other pro- 
threshold, as indicated at block 304. Essentially, FIG. 6C grammable apparatus whose actions are directed by a com- 
trims the entire Phrase List 340, including entries from either 45 puter program or other software. 

memory block 330 or 335. Non-limiting, illustrative articles of manufacture (storage 

When collection frequency information is available (as media with executable code) may include the disk memory 

illustrated by memory block 380), the minimum frequency 116 (FIG. 1), the disk memories 52-54 (FIG. 2), other 
_ requirement-oLtwo encounters-for-the lower-case-phrases- -magnetic disksr optical"disks, ^nvehtionar35 : inch~, X44~ 

within a text (see FIG. 5) can be lowered to one encounter. 50 MB "floppy" diskettes or other magnetic diskettes, magnetic 

"Mistaken" phrases will be rejected when consulting the tapes, and the like. Each constitutes a computer readable 

collection frequency information when considering multiple memory that can be used to direct the computer to function 

documents. in a particular manner when used by the computer. 

FIG. 6D is a flow chart of an optional processing module Those skilled in the art, given the preceding description of 

for categorizing proper names. 55 the inventive method, are readily capable of using knowl- 

Referring now to FIG. 6D, after proper names are iden- edge of hardware, of operating systems and software 

tified and are stored in the Proper Name List 330, it is platforms, of programming languages, and of storage media, 

possible to categorize them into new sets of pre-defined to make and use apparatus for phrase recognition, as well as 

groups, such as company names, geographic names, orga- computer readable memory articles of manufacture which, 

nization names, peoples 1 names, and product names. 60 when used in conjunction with a computer can carry out 

A list 385A of company indicators (e.g., "Co." and phrase recognition. Thus, the invention's scope includes not 

"Limited") is used for determining whether the last word in only the method itself, but apparatus and articles of manu- 

a proper name is such an indicator, and thereafter for facture. 

categorizing it into the group of company name. Any word Applications of the phrase recognition method. The 

after this indicator is removed from the proper name. 65 phrase recognition method described above can be used in a 

With the knowledge of the company name, it may be variety of text searching systems. These include, but need 

useful to check the existence of the same company name in not be limited to, dynamic phrase recognition in on-line 
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systems, batch phrase recognition in a distributed develop- played current document could not successfully be pro- 

ment system, and batch phrase recognition in a mainframe cessed under the ".more" command, 

system. The following description of the applications of the Block 708 indicates that the newly-added words or 

inventive phrase recognition method is illustrative, and phrases are added to the search query which previously 

should not limit the scope of the invention as defined by the 5 resulted in the user's viewing the current document. Block 

claims. 709 indicates the system's displaying the new "combined" 

In an on-line system (OLS) envisioned as a target appli- search query to the user. The user may edit the new query, 

cation for the inventive phrase recognition method, a user or may simply accept the new query by pressing "enter", 

viewing a current document and entering a command to FIG. 7B schematically indicates implementation of the 

search for documents of similar conceptual content must 10 phrase recognition method in a batch phrase recognition 

wait for the phrase recognition process to be completed. system in a distributed development system. 

Accordingly, the efficiency of the inventive phrase recogni- In contrast to the implementation of the on-line system of 

tion method is important, as it allows reduced response time FIG. 7A, in the application shown in FIG. 7B, the phrase 

and uses minimal resources in a time-sharing environment. recognition method is applied to a large collection of 

According to the application of the invention in a given 15 documents, and produces a list of phrases associated with 

on-line system, the method processes the text in a single the entire collection. As mentioned above with reference to 

document in real time to arrive at a list of "good" phrases, FIG. 7A, the phrase dictionary may be generated by this 

namely, ones which can be used as accurate and meaningful batch recognition process in the "distributed development 

indications of the document's conceptual content, and which domain" (DDD) when there is an abundance of idle system 

can be used as similarly accurate and meaningful queries in 20 resources. When the on-line system then uses the resultant 

subsequent search requests. In particular, according to a phrase dictionary, the phrase dictionary is essentially static, 

preferred application, the Phrase List derived from the single having been generated and modified outside the on-line 

document is used to construct a new search description to sessions. 

retrieve additional documents with similar conceptual con- The FIG. 7B application takes longer to execute than the 

tent to the first document. 25 single-document phrase recognition process occurring in the 

This implementation of the phrase recognition method dynamic phrase recognition in the on-line application, 

may, for example, be embedded in session administrator Accordingly, the FIG. 7B process is preferably executed as 

(FIG. 2) or other software which governs operation of the a batch process at times when overall system usage is not 

computer system on which the phrase recognition method. impaired, such as overnight. In particular, the software 

Of course, the particular implementation will vary with the 30 implementation of the phrase recognition/phrase dictionary 

software and hardware environment of the particular appli- building process may be implemented on SUN work sta- 

cation in question. tions. 

FIG. 7A indicates a user's viewing of a document in As a background to FIG. 7B, a developer's control file 

accordance with a suitable on-line text search system, and defines which documents, and/or which portions of the 

invoking the inventive phrase recognition method to search 35 documents, should be processed in a given run. Block 723 

for additional documents of similar conceptual content. In indicates a filtering process which filters out documents and 

particular, block 701 assumes a user viewing a given docu- portions of documents which are not desired to contribute to 

ment enters a command (such as ".more*') to retrieve more the phrase dictionary, based on the control file. Block 724 

documents similar in conceptual content to the current one indicates application of the inventive phrase recognition 

being viewed. 40 method to the documents and portions of documents which 

When the ".more" command is entered, control passes to have passed through filter process 723. 

block 702 which indicates retrieval of the document being The output of the phrase recognition process is a phrase 

viewed and passing it to the session administrator or other list (PL) which, in the illustrated non-limiting embodiment, 

software which includes the inventive phrase recognition is stored as a standard UNIX text file on disk. In a preferred 

software. 45 embodiment, single-word terms which are encountered are 

Block 703 indicates execution of the inventive phrase discarded, so mat only multiple word phrases are included in 

recognition method on the text in the retrieved document. A the phrase list (PL). 

candidate phrase list is generated, based on that single For simplicity, each phrase is provided on a single fine in 

.document — — the file.~Block 725 indicates how the" UNIX'file is^sorted - 

Block 704 indicates how the candidate phrase list gener- 50 using, for example, the standard UNIX sort utility, causing 

ated from the single document may be validated against an duplicate phrases to be grouped together. Block 725 also 

existing (larger) phrase dictionary. The static phrase dictio- calculates the frequency of each of the grouped phrases, 

nary may be generated as described below, with reference to If a given phrase occurs less than a given threshold 

the batch phrase recognition application in a distributed number of times (e.g., five times as tested by decision block 

development system. 55 727) it is discarded, as indicated by decision block 726. Only 

If a candidate phrase does not already exist in the phrase phrases which have been encountered at least that threshold 

dictionary, the candidate phrase is broken down into its number of times survive to be included in the revised Phrase 

component words. Ultimately, a list of surviving phrases is list, as shown in block 728. 

chosen, based on frequency of occurrence. The revised Phrase List is then transferred from the SUN 

At decision block 705, if at least a given threshold number 60 work station to its desired destination for use in, for 

of words or phrases (e.g., five words or phrases) is extracted, example, the on-line system described above. It may also be 

control passes from decision block 705 to block 708, transferred to a main frame computer using a file transfer 

described below. protocol FTP, to be processed by a phrase dictionary build- 

If, however, the given threshold number of words or ing program and compiled into a production phrase dictio- 

phrases are not extracted, control passes from decision block 65 nary. This process is shown illustrated as block 729. 

705 along path 706 back to block 701, after displaying an Referring now to FIG. 7C, the application of the inventive 

error message at block 707 which indicates that the dis- phrase recognition method on a mainframe system is sche- 
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matically illustrated. In the illustrated application, the phrase 
recognition method is implemented as a batch process in a 
production mainframe. The process involves a random 
sample of documents from a larger collection of documents, 
and produces a set of phrases for each document processed. 
The process is preferably executed when system resources 
are not otherwise in high demand, such as overnight. The 
process of FIG. 7C is especially useful for use with statis- 
tical thesauri. 

As a background, it is assumed that phrases may be 
considered to be "related" to each other if they occur in the 
same document. This "relationship" can be exploited for 
such purposes as expanding a user's search query. However, 
in order to provide this ability, large number of documents 
must first be processed. 

Referring again to FIG. 7C, block 733 indicates the 
filtering of documents and portions thereof in accordance 
with specifications from a control file, in much the same 
manner as described with reference to FIG. 7B. Block 734 
indicates the application of the inventive phrase recognition 
method to the documents which pass the filter. One set of 
terms (single words, phrases or both) is produced for each 
document and stored in respective suitably formatted data 
structure on a disk or other storage medium. 

Further details of implementation of the applications of 
the inventive phrase recognition method depend on the 
particular hardware system, software platform, program- 
ming languages, and storage media being chosen, and lie 
within the ability of those skilled in the art. 

The following List are exemplary, illustrative, non- 
limiting examples of a Partition List and other lists which 
may be used with an embodiment of the phrase recognition 
method according to the present invention. 
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List A 

Example of PARTITION LIST 
(On-Line System with News Data) 
Copyright 1995 
LEXIS-NEXIS, a Division of Reed Elsevier Inc. 
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List A 

Example of PARTITION UST 
(On-line System with News Data) 
Copyright 1995 
LEXIS-NEXIS, a Division of Reed Elsevier Inc. 



AWAY 

10 JUNE 
JUST 
LIKE 
MANY 
MAR 
MARCH 
MAY 
ME 

MIGHT 
MINE 
MON 
MONDAY 
MORE 
MUCH 
MUST 
MY 

MYSELF 
N.S 
NANE 
25 NEITHER 

NEVERTHELESS 
NINE 

NINETEEN 
NO 

NOBLEWOMAN 
30 NOBODY 
NONE 
NOR 
NOT 
NOV 

NOVEMBER 
NOW 
O.S 
OCT 

OCTOBER 
OF 
OFF 



35 



EVERYONE 

ONLY 
ONTO 
OR 

OTHER 

OTHERWISE 

OUGHT 

OUR 

OUR'N 

OURSELF 

OURSELVE 

OUT 

OVER 

RM 

PERHAP 

QUIBUS 

QUITE 

RATHER 

REALLY 

REV 

SAID 

SAME 

SAT 

SATURDAY 

SAY 

SEE 

SEEMED 
SELF 
SEP 
SEPT 

SEPTEMBER 

SEVEN 

SEVENTEEN 

SEVERAL 

SHE 

SHOULD 

SINCE 

SIR 



A 


BAITH 


EVERY- 


HEREOF 


40 


OFTEN 


srx 






THING 




ON 


SIXTEEN 


AM 


BE 


EXCEPT 


HEREON 




ONE 


SO 


ABOUT 


BECAME 


FEB 


HERETO 




ONESELF 


SOME 


ABOVE 


BECAUSE 


FEBRUARY 


HEREWITH 




WHERE 


17TH 


ACROSS 


BEEN 


FEW 


HERN 




WHEREBY 


18TH 


AFFECT 


BEFORE 


FEWER 


HERSELF 




WHEREEVER 


19TH 


AFTER 


BEING 


FIFTEEN 


me 


45 


WHEREIN 


2D 


AGAIN 


BELOW 


FIVE 


HIM 




WHETHER 


2ND 


AGO 


BETWEEN 


FOR 


HIMSELF 




WHICH 


20TH 


ALL 


BOTH 


FOUR 


HIS 




WHICHEVER 


21 ST 


ALREADY 


BUT 


FOURTEEN 


HIS'N 




WHICHSOEVER 


3D 


ALSO 


BY — 


FRI 


HISS ELF — - 




WHILE 


"3RD 


ALTHOUGH 


CAN 


FRIDAY 


HOC 


50 


WHO 


4TH 


ALWAY 


COULD 


FROM 


HOW 




WHOEVER 


5TH 


AN 


DEC 


GET 


HOWEVER 




WHOM 


6TH 


AND 


DECEMBER 


GO 


I 




WHOMSOEVER 


7TH 


ANOTHER 


DID 


GOT 


I'D 




WHOSE 


8TH 


ANY 


DO 


HAD 


I'LL 




WHOSESOEVER 


9TH 


ANYBODY 


DOE 


HAPPEN 


I'M 


55 


WHOSO 




ANYMORE 


DUE 


HARDLY 


I'VE 


WHOSOEVER 




ANYONE 


DURING 


HAS 


IE 




WHY 




ANYTHING 


EG 


HAVE 


IF 




WILL 




APR 


EACH 


HAVING 


IN 




WITH 




APRIL 


EIGHT 


HE 


INTO 




WITHIN 




ARE 


EIGHTEEN 


HENCE 


IS 


60 


WITHOUT 




AROUND 


EITHER 


HER 


IT 


WOULD 




AS 


ELEVEN 


HER'N 


ITS 




YA 




ASIDE 


EVEN 


HERE 


ITSELF 




YE 




ASK 


EVEN- 
TUALLY 


HEREAFTER 


JAN 




YES 

YESTERDAY 




AT 


EVER 


HEREBY 


JANUARY 




YET 




AUG 


EVERYBODY 


HEREIN 


JUL 


65 


YON 




AUGUST 


EVERYMAN 


HERE IN - 


JULY 




YOU 





AFTER 

HEREIN- 

SOFAR 

SOMEBODY 

SOMEONE 

SOMETHING 

SOMETIME 

SOMEWHERE 

SOONER 

STILL 

SUCCUSSION 

SUCH 

SUN 

SUNDAY 

TAKE 

TEN 

THAE 

THAN 

THAT 

THE 

THEE 

THEIR 

THEIRSELF 

THEIRSELVE 

THEM 

THEMSELVE 

THEN 

THERE 

THEREAFTER 

THEREBY 

THEREFORE 

THEREFROM 

THEREIN 

THEREOF 

THEREON 

THERETO 

THEREWITH 

THESE 

THEY 

THIRTEEN 

THIS 

THOSE 

THOU 

THOUGH 



JUN 

THREE 

THROUGH 

THUR 

THURSDAY 

THUS 

THY 

THYSELF 

TILL 

TO 

TODAY 
TOMORROW 
TOO 
TUB 

TUESDAY 

TWELVE 

TWENTY 

TWO 

UN 

UNDER 

UNLESS 

UNTIL 

UNTO 

UP 

UPON 

US 

USE 

VERY 

VIZ 

WAS 

WE 

WED 

WEDNESDAY 

WERE 

WHAT 

WHATE'ER 

WHATEVER 

WHATSOE'ER 

WHATSOEVER 

WHEN 

WHENEVER 

WHENSOEVER 
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list A 

Example of PARTITION LIST 
(On-line System with News Data) 
Copyright 1995 
LEXIS-NEXIS, a Division of Reed Elsevier Inc. 

YOUR 

YOUR'N 

YOURSELF 

1ST 

10TH 

11TH 

12TH 

13TH 

14TH 

15TH 

16TH 



List B 

Example of COMPANY INDICATOR LIST 
Copyright 1995 
LEXIS-NEXIS, a Division of Rccd Elsevier Inc. 

BROS 

BROS, 

BROTHERS 

CHARTERED 

CHTD 

CHTD. 

CL 

CL. 

CO 

CO. 

COMPANY 

CORP 

CORP. 

CORPORATION 

CP 

CP. 

ENTERPRISES 

GP 

GP. 

GROUP 

INC 

INC 

INCORP 

INCORP. 

INCORPORATED 

INE 

INE. 

LIMITED 
LNC 
LNC 
LTD 

LTD. 



List C 

Example of PRODUCT NAME LIST 
Copyright 1995 
LEXIS-NEXIS, a Division of Reed Elsevier Inc. 



List C 

Example of PRODUCT NAME LIST 
Copyright 1995 
LEXIS-NEXIS, a Division of Reed Elsevier Inc. 



240sx 


Infinity 


Reebok 


30Qsx 


Ingres 


Rolaids 


4-Rnnner 


JVC 


SA8 


7Up 


Jaguar 


Sable 


Access 


Jeep 


Senlra 


Adobe 


Keds 


Seven-Up 


Altimfl 


Kleenex 


Solaris 


Arid 


L.O.C 


Sony 


Avia 


Lexus 


Sprite 


B-17 


Linux 


Suave 


B17 


Lotus 


Sun 


BMW 


Magnavox 


Sybase 



10 



15 



20 



25 



30 



35 



40 



45 



50 



55 



60 



Bayer 


Maxima 


Taurus 


Blazer 


Mercedes 


Tide 


Bounty 


Minolta 


Toshiba 


Camary 


Mitsubishi 


Turns 


Cannon 


Mustang 


Tylenol 


Chevy 


Nike 


Windex 


Cirrus 


Nikon 


Windows 


Coke 


052 


Yashika 


Converse 


Oracle 


Zoom 


Corvette 


P100 




E tonic 


PI 20 




Excel 


P133 




F-14 


P60 




F-15 


P75 




F-16 


P90 




F-18 


Paradox 




F-22 


Pepsi 




F14 


Preparation- H 




F15 


Pufifc 




F16 


Puma 




F18 


Quicken 




• F22 


Rave 





List D 

Example of ORGANIZATION INDICATOR LIST 

Copyright 1995 
LEXIS-NEXIS, a Division of Reed Elsevier Inc. 



ADMINISTRATION 


MEN 


AGENCY 


ORGANIZATION 


ARMY 


PARLIAMENT 


ASSEMBLY 


PARUMENT 


ASSOCIATION 


PARTY 


BOARD 


REPUBLIC 


BUREAU 


SCHOOL 


CENTER 


SENATE 


CHURCH 


SOCIETY 


CLINIC 


TEAM 


CLUB 


UNION 


COLLEGE 


UNIVERSITY 


COMMISSION 




COMMITTEE 




COMMUNITY 




CONGRESS 




COUNCIL 




COURT 




CULT 




DEPARTMENT 




DEFT 




FACTION 




FEDERATION 




FOUNDATION 




GOVERNMENT 




GUILD 




HOSPITAL 




HOUSE 




INDUSTRY 




INSTITUTE 




LEAGUE 





ListE 

Example of ENGLISH FIRST-NAME LIST 
Copyright 1995 
LEXIS-NEXIS, a Division of Reed Elsevier Inc. 

AARON ADOLF ALBERT ALLECIA 

65 ABAGAIL ADOLPH ALBERTA ALLEEN 

ABBIE ADOLPHUS ALB IN ALLEGRA 
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List E 

Example of ENGLISH FIRST-NAME LIST 
Copyright 1995 
LEXIS-NEXIS, a Division of Reed Elsevier Inc. 



UstE 

Example of ENGLISH FIRST-NAME LIST 
Copyright 3995 
LEXIS-NEXIS, a Division of Reed Elsevier Inc. 



ABBY 


ADORA 


ALDO 


ALLEN 




BART 


BERTON 


BORIS 


BRUNO 


ABE 


ADRIAN 


ALDUS 


ALLENE 




BARTHOLOMEW 


BERTRAM 


BOYCE 


BRYAN 


ABEGAIL 


ADRIANE 


ALEC 


ALLICIA 




BARTON 


BERTRAND 


BOYD 


BRYANT 


ABEL 


ADRIANNE 


ALECIA 


ALLIE 




BASIL 


BERTRUM 


BRACIE 


BRYCE 


ABELARD 


ADRIEN 


ALECK 


ALLISON 


10 


BAYARD 


BERYL 


BRACK 


BRYON 


ABIGAIL 


ADRIENNE 


ALENE 


ALLOYSIUS 




BEA 


BESS 


BRAD 


BUBBA 


ABNER 


AERIEL 


ALEX 


ALLY 




BEATRICE 


BESSIE 


BRAD DOCK 


BUCK 


ABRAHAM 


AGATHA 


ALEXANDER 


ALLYN 




BEATRIX 


BETH 


BRADLEY 


BUCKY 


ABRAM 


AGGIE 


ALEXANDRA 


ALMA 




BEAUREGARD 


BETSEY 


BRADLY 


BUD 


ACE 


AGGY 


ALEXANDRIA 


ALMETA 




BEBE 


BETSIE 


BRAINARD 


BUDDIE 


ACT 


AGNES 


ALEXEI 


ALMIRA 


15 


BECCA 


BETSY 


BRAINERD 


BUDDY 


ADA 


AGNETA 


ALEXI 


ALMON 


BECKY 


BETTE 


BRANDI 


BUEL 


ADAH 


AGUSTIN 


ALEXIA 


ALONZA 




BEE 


BETTIE 


BRANDY 


BUELL 


ADAIR 


AHARON 


ALEXIS 


ALOYSIUS 




BELINDA 


BETTY 


BRANKA 


BUFFIE 


ADALBERT 


AIDA 


ALF 


ALPHA 




BELLA 


BETTYE 


BREK 


BUFFY 


AD ALINE 


AILEEN 


ALFIE 


ALPHONSUS 




BELLE 


BEULAH 


BRENARD 


BUFORD 


ADAM 


AILEENE 


ALFIO 


ALTA 


20 


BEN 


BEVERLEE 


B REND A 


BUNNIE 


ADDAM 


AILENE 


ALFORD 


ALTHEA 


BENEDICT 


BEVERLIE 


BRENDAN 


BUNNY 


ADDIE 


AIME 


ALFRED 


ALTON 




BENJAMIN 


BEVERLY 


BRENT 


BURL 


ADDY 


AIMEE 


ALFREDA 


ALVA 




BENJI 


BEWANDA 


BRET 


BURNELL 


ADELA 


AINSLEE 


ALFY 


ALVAH 




BENNETT 


BIFF 


BRETT 


BURNETTA 


ADELAIDE 


AINSLEY 


ALGERNON 


ALVESTER 




BENNIE 


BILL 


BRIAN 


BURNICE 


ADELBERT 


AJA 


ALICE 


ALVTN 




BENNO 


BILUE 


BRICE 


BURR EL 


ADELE 


AL 


ALICIA 


ALYCE 


25 


BENNY 


BILLY 


BRIDGET 


BURT 


ADELENE 


ALAIN 


ALINE 


AMALIA 




BENTLEY 


BIRD 


BRIDGETT 


BURTON 


ADELINE 


ALAINE 


ALISA 


AMANDA 




BERKE 


BJARNE 


BRIDGETTE 


BURTRAM 


ADELLA 


ALAN 


ALISHA 


AMARYLLIS 




BERKELEY 


BJORN 


BRIDIE 


BUSTER 


ADELLE 


ALAN AH 


ALISON 


AMBER 




BERKELY 


BJORNE 


BRIGIT 


BUTCH 


ADLAI 


ALANNA 


ALDC 


AMBROSE 




BERKLEY 


BLAINE 


BRIGITTE 


BYRON 


ADNA 


ALASTAIR 


ALLAN 


AMBROSIA 


30 


BERLE 


BLAIR 


BRUITTE 


CAESAR 


AMBROSIUS 


ANNE MARIE 


ARLO 


AUDREY 




BERNARD 


BLAKE 


BRITNY 


CAITLIN 


AMELIA 


ANNETTA 


ARMAND 


AUDRIE 




BERNETTA 


BLANCA 


BRITTANY 


CAL 


AMIE 


ANNETTE 


ARMIN 


AUDRY 




BERNETTE 


BLANCH 


BRITTNEY 


CALE 


AMILE 


ANNICE 


ARMOND 


AUDY 




BERNHARD 


BLANCHE 


BRJTTNY 


CALEB 


AMITY 


ANNIE 


ARNE 


AUGIE 




BERNICE 


BOB 


BROCK 


CALLA 


AMON 


ANNINA 


ARNETT 


AUGUST 


35 


BERNIE 


BOBBI 


BRODERICK 


CALLffi 


AMOS 


ANNMARIE 


ARNEY 


AUGUSTINE 


BERRIE 


BOBBIE 


BROOKE 


CALLY 


AMY 


ANSEL 


ARNIE 


AUGUSTUS 




CALVIN 


CARROLL 


CHARLEEN 


CHRYSTAL 


ANA 


ANSELM 


ARNOLD 


AURELiA 




CAM 


CARSON 


CHARLENE 


CHUCK 


AN ABEL 


ANSON 


ARON 


AUREUUS 




CAMDEN 


CARY 


CHARLES 


CHUMLEY 


ANABELLE 


ANTHONY 


ART 


AUSTEN 




CAMERON 


CARYL 


CHARLESE 


CICELY 


ANAUSE 


ANTOINE 


ARTE 


AUSTIN 


40 


CAMILE 


CARYN 


CHARLETON 


CICILY 


AN ASTASIA 


ANTOINETTE 


ARTEMIS 


AUTHER 


CAMILLA 


CAS 


CHARLEY 


CINDI 


ANATOLY 


ANTON 


ARTEMUS 


AUTRY 




CAMILLE 


CASEY 


CHARLIE 


CINDY 


ANCBL 


ANTONE 


ARTHUR 


AUVEL 




CANDACE 


CASI 


CHARLINE 


CLAIR 


ANDIE 


ANTONETTE 


ARTIE 


AVA 




CANDI 


CASPAR 


CHARLISE 


CLAIRE 


ANDREA 


ANTONI 


ARTIS 


AVERY 




CANDICE 


CASPER 


CHARLOTTA 


CLARA 


ANDREAS 


ANTONIA 


ARTY 


AVIS 




CANDIS 


CASS 


CHARLOTTE 


CLARA- 


ANDREE 


ANTONIO 


ARVELL 


AvTTUS 


45 








BELLE 


ANDREI 


ANTONY 


ARVIE 


AVON 




CANDUS 


CASSANDRA 


CHARLTON 


CLARE 


ANDREJ 


AP 


ARVO 


AVRAM 




CANDY 


CASS IE 


CHAS 


CLARENCE 


ANDREW 


APOLLO 


ARVON 


AXEL 




CANNIE 


CASSIUS 


CHASTITY 


CLARICE 


ANDY 


APRIL 


ASA 


AZZIE 




CARA 


CATHARINE 


CHAUNCEY 


CLARINA 


ANETTA 


-ARA 


-ASHELY 


AZZY 




CAREN 


CATHERINE 


CHELSIE 


CLARISSA 


ANETTE 


ARAM 


ASHER 


BABETTE 


50 


CAREY 


CATHLEEN 


CHER 


CLARK 


ANGELA 


ARBY 


ASHLEIGH 


BAILEY 




CARI 


CATHLENE 


CHERI 


CLASSIE 


ANGELICA 


ARCH 


ASHLEY 


BAIRD 




CARIN 


CATHRINE 


CHERIE 


CLAUD 


ANGELINA 


ARCHIBALD 


ASTER 


BALTHAZAR 




CARL 


CATHRYN 


CHERYL 


CLAUDE 


ANGEUNE 


ARCHIE 


ASTOR 


BAMBI 




CARLA 


CATHY 


CHESTER 


CLAUDELLE 


ANGELIQUE 


ARETHA 


ASTRID 


BARB 




CARLEEN 


CEASAR 


CHET 


CLAUDETTE 


ANGIE 


ARIC 


ATHENA 


BARBARA 


55 


CARLENE 


CEATRICE 


CHIP 


CLAUDIA 


ANGUS 


ARICA 


ATHENE 


BARBEE 


CARLETON 


CECELIA 


CHLOE 


CLAUDINE 


ANITA 


ARIEL 


ATTILIO 


BARB I 




CARUNE 


CECIL 


CHLORIS 


CLAUDIUS 


ANN 


ARISTOTLE 


AUBREY 


BARBIE 




CARLISLE 


CECILE 


CHRIS 


CLAUS 


ANNA 


ARLAN 


AUBRIE 


BARBRA 




CARLTON 


CECILIA 


CHRISSIE 


CLAY 


ANNABEL 


ARLEEN 


AUBRY 


BARNABAS 




CARLY 


CECILY 


CHRISSY 


CLAYMON 


ANNAS FT .1 ,F 


ARLEN 


AUD 


BARN AB US 


60 


CARLYLE 


CEDRIC 


CHRISTA 


CLAYTON 


ANN ALEE 


ARLENE 


AUDEY 


BARNABY 


CARMINE 


CEFERINO 


CHRISTABEL 


CLEIO 


ANNE 


ARLIE 


AUDIE 


BARNARD 




CAROL 


CELESTE 


CHRISTABELLE 


CLEM 


annfi.if.se 


ARLIN 


AUDINE 


BARNET 




CAROLA 


CELESTINE 


CHRISTAL 


CLEMENT 


ANNELISE 


ARLINE 


AUDIO 


BARNETT 




CAROLANN 


CELIA 


CHRISTIAAN 


CLEMENTINE 


BARNEY 


BERRY 


BOBBY 


BROOKS 




CAROLE 


CELINA 


CHRISTIAN 


CLEMENZA 


BARNY 


BERT 


BONME 


BRUCE 




CAROLE 


CESAR 


CHRISTIE 


CLENELL 


BARRETT 


BERTHA 


BONNY 


BRUNHILDA 


65 


CAROLINE 


CHAD 


CHRISTINE 


CLEO 


BARRY 


BERTHOLD 


BOOKER 


BRUNHDLDE 




CAROLYN 


CHADWICK 


CHRISTOFER 


CLEOPHUS 
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CAROLYNN 


CHAIM 


CHRISTOPH 


CLEOTHA 




DERRALL 


DOLLY 


DOT 


EARNESTTNE 


CARREN 


CHANCY 


CHRISTOPHER 


CLEOTIS 




DERREK 


DOLORES 


DOTTLE 


EARTHA 


CARRIE 


CHANDLER 


CHRISTOS 


CLETA 




DERRICK 


DOM 


DOTTY 


EARTHEL 


CARRIN 


CHARITY 


CHRISTY 


CLETUS 




DERRY 


DOMENIC 


DOTY 


EBEN 


CLEVE 


CORKY 


DAGMAR 


DARRIN 


10 


DERWOOD 


DOMENICK 


DOUG 


EBENEEZER 


CLEVELAND 


CORNEAL 


DAGWOOD 


DARRYL 




DESDEMONA 


DOMER 


DOUGIE 


EBENEZER 


CLEVON 


CORNELIA 


DAISEY 


DARWIN 




DESI 


DOMINIC 


DOUGLAS 


EBERHARD 


CUFF 


CORNELIUS 


DAISY 


DARYL 




DESIRE 


DOMINICK 


DOUGLASS 


EBONY 


CLIFFORD 


CORRIE 


DALE 


DASHA 




DESIREE 


DOMINICKA 


DOY 


ED 


CLIFT 


CORRINE 


DALTON 


DAVE 




DESMOND 


DOMINIQUE 


DOYLE 


EDD 


CLIFTON 


CORRINNE 


DAM IAN 


DAVEY 


15 


DESMUND 


DON 


DREW 


EDDIE 


CLINT 


CORRY 


DAMIEN 


DAVID 


DEVON 


DONALD 


DRU 


EDDY 


CLINTON 


CORTNEY 


DAMION 


DAVTDA 




EDGAR 


ELEONORA 


ELSIE 


EPHRIAM 


CLIO 


CORY 


DAMON 


DAVIE 




EDGER 


ELI 


ELTON 


ERASMUS 


CUTUS 


COSMO 


DAN 


DAVY 




EDE 


ELLAS 


ELVA 


ERBIN 


CLIVE 


COUNTEE 


DAN'L 


DAWN 




EDISON 


ELIC 


ELVERT 


ERIC 


CLOVIA 


COURTLAND 


DANA 


DEAN 


20 


EDITA 


ELIJAH 


ELVIE 


ERICA 


CLOVIS 


COURTNEY 


DANIEL 


DEANDRA 


EDITH 


ELINORE 


ELV1N 


ERICH 


CLOYD 


COY 


DANIELLA 


DEANE 




EDMOND 


ELIOT 


ELVIRA 


ERICK 


CLYDE 


CRAIG 


DANIELLE 


DEANNA 




EDMUND 


ELISABETH 


ELVIS 


ERIK 


CODELL 


CRIS 


DANNA 


DEANNE 




EDNA 


EUSE 


ELVON 


EREKA 


COLBERT 


CRISPIN 


DANNY 


DEB 




EDRIS 


EliSHA 


ELWOOD 


ERIN 


COLE 


CRISPUS 


DANO 


DEBBI 




EDSEL 


EUSSA 


ELY 


ERLAND 


COLE EN 


CRISSIE 


DANUTA 


DEBBIE 


25 


EDUARD 


ELIUS 


ELZA 


ERLE 


COLETTE 


CRISSY 


DAPHNE 


DEBBY 




EDWARD 


ELIZA 


EMELDA 


ERMA 


COLIN 


CRISTABEL 


D ARB IE 


DEBORA 




EDWIN 


ELIZABETH 


EMERiC 


ERMAN 


COLTTA 


CRISTA- 


DARBY 


DEBORAH 




EDWINA 


ELIZAR 


EMERY 


ERNEST 




BELLE 








EDY 


ELKE 


EMETT 


ERNESTINE 


COLLEEN 


CRYSTAL 


DARCEE 


DEBRA 




EDYTH 


ELLA 


EMIL 


ERNESTO 


COLLETTE 


CURLESS 


DARCEY 


DEBS 


30 


EDYTHE 


ELLE 


EMILE 


ERNIE 


COLLIN 


CURLY 


DARCI 


DEDIE 




EFFIE 


ELLEN 


EMILLE 


ERNST 


COLON 


CURT 


DARCIE 


DEE 




EFRAIM 


ELLERY 


EMILY 


ERROL 


CONNIE 


CURTIS 


DARCY 


DEEANN 




EGBERT 


PI J. TP 


EMMA 


ERVAN 


CONNY 


CY 


DARIEN 


DEEANNE 




EGIDIO 


ELLIET 


EMMALINE 


ERVEN 


CONRAD 


CYBIL 


DARIO 


DEEDEE 




EILEEN 


ELLIOT 


EMMERY 


ERVIN 


CONROY 


CYBILL 


DARIUS 


DEIDRE 


35 


ELA 


ELLIOTT 


EMMET 


ERWIN 


CONSTANCE 


CYNDI 


DARLA 


DEL 


ELAINE 


ELLIS 


EMMETT 


ESAU 


CONSTANTTA 


CYNDY 


DARLEEN 


DELAINE 




ELAYNE 


ELLY 


EMMIE 


ESMERELDA 


COOKIE 


CYNTHIA 


DARLENE 


DELANE 




ELBA 


ELLYN 


EMMOT 


ESTA 


CORA 


CYRIL 


DARLINE 


DELANO 




ELBERT 


ELMA 


EMMOTT 


ESTEL 


CORABELLE 


CYRILL 


DARLYNE 


DELBERT 




ELBERTA 


ELMER 


EMMY 


ESTELA 


CORDELIA 


CYRILLA 


DARNELL 


DELIA 


40 


ELDA 


ELMERA 


EMOGENE 


ESTELLA 


COREY 


CYRUS 


DAROLD 


DELL 


ELDINE 


ELMO 


EMORY 


ESTF.LT F. 


CORINE 


DABNEY 


DARREL 


DELLA 




ELDON 


ELNOR 


ENDORA 


ESTHR 


CORINNE 


DACIA 


DARRELL 


DELLO 




ELDORA 


ELNORA 


ENDRE 


ESTHA 


CORKIE 


DACIE 


DARREN 


DELMA 




ELE 


ELOI 


ENGBLBERT 


ESTHER 


DELMAR 


DEVORAH 


DONELL 


DUAIN 




ELEANOR 


ELOISE 


ENID 


ETHAN 


DELMAS 


DEWANE 


DONELLE 


DUATNE 




ELEANORA 


ELOUISE 


ENISE 


ETHEL 


DELMO 


DEWAYNE 


DONICE 


DUANE 


45 


ELEANORE 


ELOY 


ENNIS 


ETHELENE 


DELNO 


DEWEY 


DONLS 


DUB 




ELECTRA 


ELRIC 


ENOCH 


ETHELINE 


DELORES 


DEWTTT 


DONNA 


DUDLEY 




ELENA 


ELROY 


ENOLA 


ETHYL 


DELORIS 


DEXTER 


DONNELL 


DUEL 




ELENORA 


ELSA 


ENZO 


ETIENNE 


DELOY 


DEZ 


DONNELLE — 


DUELL 




ELENORE 


ELSBETH 


EPHRAIM 


ETTA 


DELTA 


DIAHANN 


DONME 


DUFF 




ETTIE 


FARRELL 


FRANCIA 


GALE 


DEMETRICE 


DIANA 


DONNY 


DUFFY 


50 


EUDORA 


FARRIS 


FRANCINE 


GALLON 


DEMETRIUS 


DIANE 


DONOVAN 


DUGALD 




EUFA 


FATTMA 


FRANCIS 


GARETH 


DENARD 


D I ANN A • 


DORA 


DUKE 




EUGENE 


FAUN 


FRANCOIS 


GARLAND 


DENE 


DLANNE 


DORCAS 


DULCIE 




EUGENIA 


FAWN 


FRANCOISE 


GARNET 


DENICE 


DICK 


DORCE 


DULSA 




EUGENIE 


FAY 


FRANK 


GAROLD 


DENILLE 


DICKEY 


DOREEN 


DUNCAN 




EUGENIO 


FAYE 


FRANKIE 


GARRET 


DENIS 


DICKIE 


DORI 


DUKWARD 


55 


EULA 


FELECIA 


FRANKLIN 


GARRETT 


DENISE 


DIDI 


DORIAN 


DURWOOD 


EULALEE 


FELICIA 


FRANKLYN 


GARRIE 


DENNIE 


DIEDRE 


DORIE 


DUSTTN 




EULALIE 


FELICITY 


FRANKY 


GARRY 


DENNIS 


DIERDRE 


DORIENNE 


DUSTY 




EULOGIO 


FELIX 


FRANNIE 


GARTH 


DENNY 


DIETER 


DORINE 


DWAIN 




EUNACE 


FELLZ 


FRANNY 


GARVIN 


DENNYS 


DIETRICH 


DORIS 


DWAINE 




EUNICE 


FERD 


FRANZ 


GARY 


DENORRIS 


DIMITRI 


DOROTHA 


DWAYNE 


60 


EUPHEMIA 


FERDINAND 


FRANZI 


GASTON 


DEO 


DINA 


DOROTHEA 


DWIGHT 


EUSTACE 


FERGUS 


FRANZIE 


GAVIN 


DEON 


DINAH 


DOROTHY 


DYLAN 




EVA 


FERN 


FRANZY 


GAY 


DEREK 


DINO 


DORRANCE 


DYNAH 




EVALEE 


FERREL 


FRED 


GAYE 


DEREWOOD 


DION 


DORRIS 


EARL 




EVAN 


FERRELL 


FREDA 


GAYLE 


DERICK 


DIRK 


DORSEY 


EARLE 




EVANDER 


FERRIS 


FREDDIE 


GAYLORD 


DERL 


DLXIE 


DORTHIE 


EARLENE 




EVANGELINE 


FIDELE 


FREDDY 


GEARY 


DERMOT 


DMITRI 


DORTHY 


EARLINE 


65 


EVE 


Fin 


FREDERICH 


GEMMA 


DERMOTT 


DOLUE 


DOSHIE 


EARNEST 




EVELYN 


FILBERT 


FREDERICK 


GENA 
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EVERETT 


FIUPPO 


FREDERIK 


GENE 




HEATHER 


HERSHEL 


HOWARD 


IONA 


EVERETTE 


FIONA 


FREEMAN 


GENEVA 




HECTOR 


HESTER 


HOWIE 


IONE 


EVETTE 


FTTZ 


FREIDA 


GENEVIEVE 




HEDDA 


HESTHER 


HOYT 


IRA 


EVTTA 


FLETCHER 


FRIEDA 


GENIE 




HEDDEE 


HETTA 


HUBERT 


IRAD 


EWALD 


FLO 


FRIEDRICH 


GENNARO 


10 


HEDWIG 


HETTIE 


HUEL 


IRENA 


EWIN 


FLOR 


FRITZ 


GENNTE 




HEIDI 


HETTLE 


HUEY 


IRENE 


EZEKIEL 


FLORA 


FRONA 


GENNIFER 




HEINRICH 


HETTY 


HUGH 


IRIS 


BZRA 


FLORANCE 


FYODOR 


GENNY 




HEINZ 


HEYWOOD 


HUGO 


IRL 


FABIAN 


FLORENCE 


GABBEY 


GENO 




HELATNE 


HEZEKIAH 


HULDA 


IRMA 


FAB EN 


FLORIDA 


GAB B IE 


GEO 




HELEN 


HILARY 


HULDAH 


IRVIN 


FAIRLEIGH 


FLOSSIE 


GABE 


GEOFF 


15 


HELENA 


HILDA 


HUMPHREY 


IRVING 


FAITH 


FLOSSY 


GABRIEL 


GEOFFREY 


HELENE 


HILDE 


HUNTINGTON 


IRWIN 


FANNIE 


FLOYD 


GABRJELE 


GEORGE 




HELGA 


HILDEGARD 


HURSCHEL 


ISAAC 


FANNY 


FONDA 


GABRIELLE 


GEORGES 




HELMUT 


HILDE- 


HY 


ISAAK 


FARAH 


FONTAINE 


GABY 


GEORGETTE 






GARDE 






FAR LEIGH 


FORD 


GAEL 


GEORGIA 




HELMUTH 


HILDRED 


HYACINTH 


ISABEL 


FARLEY 


FORREST 


GAETANO 


GEORGE 


20 


HELOISE 


HILDUR 


HYMAN 


ISABELLA 


FAR RAH 


FRAN 


GAGE 


GEO R GINA 


HENDREK 


HILLARD 


LAIN 


ISABELLE 


FARREL 


FRANCES 


GAEL 


GERALD 




HENDREKA 


HILLARY 


IAN 


ISAC 


GERALD I NE 


GIUL1A 


GREGGORY 


GWENETTA 




HENNEE 


HILLERY 


ICHABOD 


ISADOR 


GERARD 


GIUSEPPE 


GREGORY 


GWENETTE 




HENNY 


HIRAM 


EDA 


ISADORA 


GERD 


GIUSEPPI 


GRETA 


GWENTTH 




HENRI 


HIRSCH 


IGGY 


ISADORE 


GERDINE 


GIUSEPPINA 


GRETCHEN 


GWENN 




HENRI ETA 


HOBART 


IGNATIUS 


ISMAH 


GERHARD 


GLADIS 


GRETEL 


GWENNE 


25 


HENRIETTA 


HOBERT 


IGNATZ 


ISHAM 


GERI 


GLADYCE 


GRIFF 


GWENNETH 




HENRI ETTE 


HOLLEY 


IGNAZ 


ISHMAEL 


GERMAIN 


GLADYS 


GRIFFIN 


GWENYTH 




HENRI K 


HOLLI 


IGOR 


ISEAH 


GERMAINE 


GLEN 


GRIFFITH 


GWYEN 




HENRY 


HOLLIE 


EKE 


ISIDOR 


GEROLD 


G LEND A 


GROVER 


GWYLLEN 




ISIDORE 


JAN 


JEMIMA 


JELEJAN 


GEROME 


GLENDON 


GUENTHER 


GWYN 




ISMAEL 


JANA 


JEMMA 


JEM 


GERRIE 


GLENN 


GUERINO 


GWYNETH 


30 


ESOM 


JANE 


JEMMIE 


JEMBO 


GERRIT 


GLENN A 


GUCDO 


GWYNNE 




ISRAEL 


JANE EN 


JEMMY 


JEMME 


GERRY 


GLENME 


GUILLERMINA 


GYULA 




ISTVAN 


JANEL 


JENNIE 


JEMMIE 


GERT 


GLENNIS 


GUELLERMO 


HAILEY 




ETA 


JANELL 


JENNIFER 


JEMMY 


GERTA 


GLENNON 


GUISEPPE 


HAL 




EVA 


JANELLE 


JENNY * 


JENNY 


GERTIE 


GLORIA 


GUNNER 


HALLIE 




EVAN 


JANET 


JENS 


JO 


GERTRUDE 


GLYN 


GUNTER 


HALLY 


35 


IVANA 


JANICE 


JERALD 


JOAB 


GEZA 


GLYNDA 


GUNTHER 


HAMISH 


IVAR 


JANEE 


JERALDINE 


JOACHIM 


GIA 


GLYMS 


GUS 


HAMPTON 




EVARS 


JANTNE 


JERE 


JOAN 


GIACOMO 


GLYNN 


GUSSIE 


HANK 




EVETTE 


JANIS 


JEREMIAH 


JOANN 


GIDEON 


GLYNMS 


GUSSY 


HANNA 




IVEY 


JANISE 


JEREMIAS 


JOANNA 


GIFFORD 


GODFREY 


GUST 


HANNAH 




TVEE 


JANNA 


JEREMY 


JOANNE 


GIGI 


GODFRY 


GUSTAF 


HANNE 


40 


FVO 


JAQUELYN 


JERI 


JOB 


GIL 


GODWIN 


GUSTAV 


HANNES 


FVONNE 


JARED 


JEREANE 


JOCELYN 


GILBERT 


GOLDA 


GUSTAVE 


HANNIBAL 




IVOR 


JAR RETT 


JERIE 


JOCILYN 


GILDA 


GOLD IE 


GUSTAVUS 


HANS 




TVORY 


JARRYL 


JERMAIN 


JOCLYN 


GILES 


GOLOY 


GUSTOV 


HANS ELL 




IVY 


JARVIS 


JERMAINE 


JODI 


GILLIAN 


GOMER 


GUY 


HARLAN 




EZAAK 


JAS 


JEROL 


JODEE 


GILLIGAN 


GORDAN 


GWEN 


HARLEN 




IZZIE 


JASMINE 


JEROLD 


JODY 


GINA 


GORDON 


GWENDA 


HARLEY 


45 


EZZY 


JASON 


JEROME 


JOE 


GINGER 


GOTTFRIED 


GWENDALYN 


HARLTE 




JAC 


JASPER 


JERRALD 


JOEL 


GINNI 


GRACE 


G WEND EN 


HARLY 




JACK 


JAY 


JERRED 


JOETTE 


GINNIE 


GRACIA 


GWENDLYN 


HARM IE 




JACKI 


JAYME 


JERRELD 


JOEY 


GINNY 


GRACEE 


GWENDOLA 


HARMON 




JACKIE 


JAYMEE 


JERRELL 


JOFFRE_ 


GINO 


GRADY - - 


- GWENDOLEN- 


HAROL 





JACKSON 


JAYNE 


JERRI 


JOH 


GIORA 


GRAEME 


GWENDOLINE 


HAROLD 


50 


JACKY 


JEAN 


JERRLE 


JOHANN 


GIOVANNA 


GRAHAM 


GWENDOLY 


HARREL 




JACOB 


JEAN KITE 


JERROLD 


JOHANNES 


GIOVANNI 


GRANT 


GWENDOLYN 


HARRIET 




JACOBUS 


JEANIE 


JERRY 


JOHN 


GtSELA 


GRAYCE 


GWENDOLYNE 


HARRIETT 




JACQUELEN 


JEANNE 


JERZY 


JOHNTE 


GISELLA 


GREG 


GWENDY 


HARRIETTA 




JACQUELINE 


JEANNETTE 


JESS 


JOHNNA 


GISELLE 


GREGG 


GWENETH 


HARRIS 




JACQUELYN 


JEANNJNE 


JESSE 


JOHNNIE 


HARRISON 


HENRYK 


HOLLIS 


ILA 


55 


JACQUES 


JEBEDIAH 


JESSICA 


JOHNNY 


HARROLD 


HERE 


HOLLISTER 


ELAH 


JADE 


JED 


JESSIE 


JOJO 


HARRY 


HERBERT 


HOLLY 


ILEEN 




JAKE 


JEDEDIAH 


JETHRO 


JOLENE 


HARVEY 


HERBIE 


HOLLYANN 


ELENA 




JAKOB 


JEFEREY 


JETTA 


JON 


HARVIE 


HERBY 


HOLLYANNE 


ELENE 




JAMES 


JEFF 


JE1TIE 


JONAH 


HASSLE 


HERCULE 


HOMER 


ELSE 




JAMEY 


JEFFEREY 


JEWEL 


JONAS 


HATTIE 


HERLINDA 


HONEY 


EMELDA 


60 


JAMIE 


JEFFIE 


JEWELL 


JONATHAN 


HATTY 


HERM 


HONORIA 


EMOGENE 


JAMYE 


JEFFREY 


JILL 


JONATHON 


HAYDEE 


HERMA 


HOPE 


ENA 




JONELL 


JUUUS 


KELSEY 


KORNELIA 


HAYDEN 


HERMAN 


HORACE 


ENGA 




JOM 


JUNE 


KELVIN 


KORNELEUS 


HAYDON 


HERMANN 


HORST 


INGE 




JONNY 


JUNEY 


KEN 


KRAIG 


HAYLEY 


HERMIE 


HORTENSE 


ENGGA 




JORDAN 


JUNEE 


KENDALL 


KRIS 


HAYWOOD 


HEROLD 


HORTENSIA 


INGRAM 




JORDON 


JUNIOR 


KENDELL 


KRISTA 


HAZEL 


HERSCH 


HORTON 


ENGRED 


65 


JOSEA 


JUNIUS 


KENDRICK 


KRISTEN 


HEATHCUFF 


HERSCHEL 


HOSBA 


IOLA 




JOSELYN 


JURGEN 


KENNETH 


KRISTI 
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JOSEPH 


JUSTEN 


KENNY 


KRISTTE 




LEITH 


LEVI 


USABETH 


LORIN 


JOSEPHA 


JUSTICE 


KENT 


KRISTIN 




LORNA 


LUIGI 


MADELENE 


MANSEL 


JOSEPHINE 


JUSTIN 


KERI 


KRISTINE 




LORNE 


LUKE 


MADEUNE 


MARABEL 


JOSEY 


JUSTINE 


KERMIT 


KRI STOPFER 




LORRAINE 


LULA 


MADELYN 


MARC 


JOSH 


KALVIN 


KERRIE 


KRISTY 


10 


LORRAYNE 


LULAH 


MADGE 


MARCEL 


JOSHUA 


KARA 


KERRY 


KURT 




LOTHAR 


LULU 


MADIE 


MARCEUN 


JOSIAH 


KAREN 


KERSTTN 


KYLE 




LOTTIE 


LUMMIE 


MADONNA 


MARCELL 


JOSIE 


KARIN 


KERVTN 


KYM 




LOU 


LUNA 


MAE 


MARCELLA 


JOY 


KARL 


KERWIN 


LACEY 




LOUELLA 


LURLEEN 


MAGDA 


MARCELLE 


JOYCE 


KARLA 


KEVIN 


LACIE 




LOUIE 


LURLENE 


MAGDALENA 


MARCELLUS 


JOYCEANN 


KAROLA 


KJERSTEN 


LACY 


15 


LOUIS 


LURUNE 


MAGDALENE 


MAR CI 


JOZEF 


KASEY 


KJLIAN 


LADON 


LOUISA 


LUTHER 


MAGDAUNE 


MARCIA 


JOZSEF 


KASPAR 


KILLIAN 


LAINE 




LOUISE 


LUZ 


MAGGIE 


MARC3E 


JUBAL 


KASPER 


KIM 


LALAH 




LOULA 


LUZERNE 


MAGGY 


MARCUS 


JUD 


KATE 


KIMBER 


LAMAR 




LOURETTA 


LYDIA 


MAGNUS 


MARCY 


JUDAH 


KATERINA 


KTMBERLEE 


LAMARTINE 




LOVELL 


LYLE 


MAHALA 


MARDA 


JUDAS 


KATEY 


KJMBERLEIGH 


LAMBERT 


20 


LOVETTA 


LYMAN 


MAHALLA 


MARGARET 


JUDD 


KATHERINE 


KIMBERLEY 


LAMONT 


LOVETTE 


LYN 


MALA 


MARGE 


JUDE 


KATHERYN 


KIMBERLY 


LANA 




LOWELL 


LYNDA 


MAIBLE 


MARGEAUX 


JUDI 


KATHLEEN 


KIP 


LANCE 




LOY 


LYNDON 


MALTA 


MARGERY 


JUDIE 


KATHRYN 


KIR BY 


LANCELOT 




LOYAL 


LYNN 


MAJOR 


MARGI 


JUDITH 


KATHY 


KIRK 


LANE 




LUANN 


LYNNA 


MAL 


MARGIE 


JUDY 


KATIE 


KIRSTEN 


LANSON 




LUANNA 


LYNNE 


MALCOLM 


MARGO 


JULEE 


KATY 


KTRSTI 


LARA 


25 


LUANNE 


LYNNETTE 


MALINDA 


MARGOT 


JULES 


KAY 


KIRSTIE 


LARENE 




LUBY 


LYNWOOD 


MAUSSA 


MARGRET 


JULIA 


KAYE 


KIRSTIN 


LARONE 




LUCAS 


LYSANDER 


MALKA 


MARGY 


JULIAN 


KAYLEEN 


KIR STY 


LARRIS 




LUCLAN 


M 'LINDA 


MALLORIE 


MARIAM 


JULIANE 


KEENAN 


KIT 


LARRY 




LUCIE 


MABEL 


MALLORY 


MARIAN 


JULIANNE 


KEISHA 


KJTTIE 


LARS 




LUCIEN 


MABELLE 


MALORIE 


MARIANN 


JUUE 


KEITH 


KITTY 


LASKA 


30 


LUCILE 


MABLE 


MALORY 


MARIANNE 


JUUEN 


KELLEY 


KLAUS 


LASLO 




LUCILLA 


MAC 


MALYNDA 


MARLBEL 


JULIENNE 


KELU 


KONRAD 


LASZLO 




LUCILLE 


MACE 


MAME 


MAR [BELLE 


JULIET 


KELLIE 


KONSTAN- 


LATIC 




LUQNDA 


MACIE 


MAMIE 


MARLBETH 






TINOS 






LUCIUS 


MACK 


MANDEE 


MARIE 


JULIETTE 


KELLY 


KOOS 


LATIMER 




LUCRETIA 


MADALAINE 


MANDI 


MARIEL 


LAUANNA 


LELAH 


LEVON 


USE 


35 


LUCY 


MADALEINE 


MANDIE 


MARIETTA 


LAUNCIE 


LELAND 


LEW 


USSA 


LUDLOW 


MAD ALINE 


MANDY 


MAR1ETTE 


LAURA 


LELIA 


LEWELL 


UZ 




LUDWIO 


MADALYN 


MANFRED 


MARLLEE 


LAUREE 


LEUO 


LEWIS 


UZA 




LUDWIK 


MADDEE 


MANICE 


MARILYN 


LAUREL 


LEMMY 


LEXIE 


UZABETH 




LUDY 


MADDY 


MANLEY 


MARLLYNN 


LAUREN 


LEMUEL 


LEXY 


LIZZIE 




LUELLA 


MADELAINE 


MANLY 


MARINA 


LAURENCE 


LEN 


UANE 


UZZY 


40 


LUGENE 


MADELEINE 


MANNY 


MARIO 


LAURETA 


LENA 


LTBBIE 


LLEWELLYN 


MARION 


MARYBETH 


MEAGAN 


MEYER 


LAURETTA 


LENARD 


LIBBY 


LLOYD 




MARISSA 


MARY- 


MEG 


MIA 


LAURETTE 


LENDEN 


UDA 


LLOYDA 






FRANCES 






LAURI 


LENETTE 


LIEF 


LOELLA 




MARJUS 


MARYLOU 


MEGAN 


MIATTA 


LAURIE 


LENISE 


ULA 


LOGAN 




MARJORIE 


MASON 


MEL 


MICAH 


LAURI EN 


LENNIE 


ULAC 


LOIS 




MARJORY 


MATE 


MELANIE 


MICHAEL 


LAVELL 


LENNY 


LILAH 


LOLA 


45 


MARK 


MATEY 


MELANY 


MICHEL 


LAVERA 


LENORA 


LILE 


LOLETA 




MARKEE 


MATHEW 


ME LB A 


MICHELE 


LAVERN 


LENORE 


LILIAN 


LOUTA 




MARKIE 


MATHIAS 


MEUCENT 


MICHELLE 


LAVERNA 


LENWOOD 


LLLIEN 


LOLLLE 




MARKOS 


MATHILDA 


MEUNDA 


MICK 


LAVERNE 


LEO 


LLUTH 


LOLLY 




MARKUS 


MATILDA 


MEUSSA 


MICKEY 


LAVTNA 


LEOLA 


-i-.n-.i-.iA 


LON~ 





MARLA 


MATT 


MELLICENT 


MICKLE 


LAVIMA 


LEON 


LILLIAN 


LOM 


50 


MARLENA 


MATTHEW 


MELODI 


MICKY 


LAVONNE 


LEONA 


i.n.i.rp 


LONNA 




MARLENE 


MATTHIAS 


MELODIE 


MIDGE 


LAWRENCE 


LEONARD 


LILLY 


LONME 




MARLEY 


MATTIE 


MELODY 


MIKAEL 


LDA 


LEONID 


LILY 


LONNY 




MARUN 


MATTY 


MELONIE 


MIKAL 


LEA 


LEONIDA 


UN 


LONSO 




MARLON 


MATTYE 


MELONY 


MIKE 


LEAH 


LEONID AS 


LINCOLN 


LONZIE 




MARMADUKE 


MAUD 


MELVA 


MIKEAL 


LEANDER 


LEONORA 


LINDA 


LONZO 


55 


MARNEY 


MAUDE 


MELVIN 


MILAN 


LEANE 


LEOPOLD 


LINDSAY 


LONZY 


MARNIE 


MAURA 


MELVYN 


MILDRED 


LEANN 


LERA 


LINDSEY 


LORA 




MARNY 


MAUREEN 


MELYNDA 


MILES 


LEANNE 


LEROY 


LINK 


LORAIN 




MARSDEN 


MAURENE 


MENDEL 


MIUCENT 


LEATHA 


LES 


UNNEA 


LORAINE 




MARSHA 


MAUREY 


MERCEDEL 


MILLARD 


LEE 


LESLIE 


LINNIE 


LORANE 




MARSHAL 


MAURICE 


MERCEDES 


MLLUCENT 


LEEANN 


LESTER 


UNNY 


LORAY 


60 


MARSHALL 


MAURIE 


MERCY 


Mil I IP 


LEEANNE 


LETA 


LINROY 


LORAYNE 


MARTA 


MAURI NE 


MEREDETH 


MILLY 


LEENA 


LETHA 


UNUS 


LOREEN 




MARTEA 


MAURY 


MEREDITH 


MILO 


LEESA 


LETICIA 


UNVAL 


LOREN 




MARTHA 


MAVIS 


MERIDETH 


MILT 


LEFFEL 


LETTTIA 


LINWOOD 


LORENA 




MARTI 


MAX 


MERIDITH 


MILTON 


LEFTY 


LETTIE 


UNZIE 


LORENE 




MARTICA 


MAXCIE 


MERESSA 


MIMI 


LEIF 


LETTY 


UNZY 


LORETA 




MARTLE 


MAXCINE 


MERLE 


MINDY 


LEIGH 


LE VERNE 


LIONEL 


LORETTA 


65 


MARTI KA 


MAXIE 


MERUN 


MINERVA 


LEILA 


LEVERT 


USA 


LORI 




MARTI LDA 


MAXIM 


MERLYN 


MINNIE 
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MARTIN 


MAXI- 


MERREL 


MIRANDA 




ORVILLE 


PAULETTE 


PIA 


RAPHAEL 




MILIAN 








OSBERT 


PAULINA 


PEER 


RAQUEL 


MARTY 


MAXI- 


MERRELL 


MIRIAM 




OSCAR 


PAULINE 


PIERCE 


RAY 




MILLIAN 








OSGOOD 


PAVEL 


PIERINA 


RAYFORD 


MARV 


MAXINE 


MERRILL 


MISSIE 


10 


OSIE 


PEARCE 


PIERRE 


RAYMOND 


MARVA 


MAXWELL 


MERRY 


MISSY 




OSSIE 


PEARL 


PIERS 


RBT 


MARVIN 


MAY 


MERVIN 


MISTY 




OSWALD 


PEARLENE 


PIETER 


REAGAN 


MARY 


MAYBELLE 


MERWIN 


MITCH 




OTHA 


PEARLIE 


PIETRO 


REBA 


MARYAM 


MAYDA 


MERWYN 


MrrCHEL 




arms 


PEARUNE 


PLATO 


REBECCA 


MARYANN 


MAYME 


MERYL 


MITCHELL 




ons 


PEARLY 


POINDEXTER 


RECTOR 


MARYANNE 


MAYNARD 


META 


MTTZI 


15 


OTTTS 


PEDER 


POLUE 


REED 


MOE 


NAMIE 


NEWEL 


NORA 


OTTO 


PEG 


POLLY 


REGAN 


MOLLIE 


NAN 


NEWELL 


NORAH 




OVE 


PEGOI 


PORTIA 


REGGIE 


MOLLY 


NANCI 


NEWT 


NORBERT 




OVETA 


PEGGIE 


PRECY 


REGGY 


MONA 


NANC1E 


NEWTON 


NOREEN 




OWEN 


PEGGY 


PRESTON 


REGINA 


MONICA 


NANCY 


NICHAEL 


NORM 




OZZIE 


PENELOPE 


PRINCE 


REGINALD 


MONIQUE 


NANETTE 


NICHOLM 


NORMA 


20 


PAD DIE 


PENNl 


PRINCESS 


REGIS 


MONTE 


NAM 


NICHOLAS 


NORMAL 


PADDY 


PENNIE 


PRISCILLA 


REED 


MONTGOMERY 


NANNETTE 


NICK 


NORMAN 




REINHARDT 


RIPLEY 


RONNETTE 


ROXIE 


MONTY 


NANNI 


NICKI 


NORRIS 




REINHOLD 


RISA 


RONNIE 


ROXY 


MONY 


NANNIE 


NICKIE 


NORTON 




REMI 


RITA 


RONNY 


ROY 


MORDECAI 


NANNY 


NICKODEMUS 


NORVAL 




REMO 


RITCHIE 


ROOSEVELT 


ROYAL 


MOREY 


NAOMA 


NICKY 


NUNZIO 




REMUS 


RTTHA 


RORY 


ROYCE 


MORGAN 


NAOMI 


NICODEMO 


NYLE 


25 


RENA 


ROB 


ROSALEE 


ROZAIJA 


MORGANA 


NAPOLEON 


NICODEMUS 


OBADIAH 




RENATA 


ROBBI 


ROSALIA 


ROZAUE 


MORRIS 


NAT 


NICOL 


OBED 




RENATE 


ROBBIE 


ROSALIE 


RUBAN 


MOOT 


NATALIE 


NICOLA 


OBEDIAH 




RENE 


ROB BIN 


ROSA1 IND 




MORTIMER 


NATASHA 


NICOLM 


OCIE 




RENEE 


ROB BY 


ROSALINDA 


RUBENA 


MORTON 


NATASSIA 


NICOLAS 


OCTAVE 




RETA 


ROBERT 


ROSALYN 


RUBERT 


MORTY 


NATHAN 


NICOLE 


OCTAVIA 


30 


RETHA 


ROBERTA 


ROSALYND 


RUBEY 


MOSE 


NATHAMEL 


NICOLETTE 


ODEL 




REUBAN 


ROBIN 


ROSAMOND 


RUBIE 


MOSES 


NATWICK 


NICOLLE 


ODELL 




REUBEN 


ROBINA 


ROSAMONDE 


RUBIN 


MOZELLE 


NAZARETH 


NICOLO 


ODESSA 




REUBEN A 


ROBINETTE 


ROSAMUND 


RUBINA 


MULLIGAN 


NEAL 


NIGEL 


ODE 




R£ U BIN 


ROBYN 


ROSAMUNDE 


RUBY 


MURIEL 


NEALY 


NIKA 


ODIS 




REUBINA 


ROCCO 


ROSANNA 


RUBYE 


MURPHY 


NED 


NIKE 


OGDEN 


35 


REVA 


ROCHELLE 


ROSANNE 


RUDDY 


MURRAY 


NEDINE 


NTKI 


OKTAVIA 


REX 


ROCKY 


ROSCOE 


RUDOLF 


MURRELL 


NEIL 


NTKITA 


OLA 




REXFORD 


ROD 


ROSE 


RUDOLPH 


MURRY 


NEILL 


NTKITAS 


OLAF 




REY 


RODDIE 


ROSEANN 


RUDY 


MYLES 


NELDA 


NIKKI 


OLAN 




REYNALD 


RODERIC 


ROSEANNE 


RUE 


MYNA 


NELL 


NILE 


OLEO 




REYNOLD 


RODERICH 


ROSEBUD 


RUEBEN 


MYRA 


NELLE 


NTLES 


OLEN 


40 


RHEA 


RODERICK 


ROSELIN 


RUFUS 


MYRAH 


NELLIE 


NILS 


OLGA 


RHI ICELANDER 


RODGER 


ROSELYN 


RULOEF 


MYRAL 


NELLY 


NTMROD 


OLIN 




RHODA 


RODNEY 


ROSEMARIE 


RUPERT 


MYREN 


NELS 


NINA 


OUVE 




RHONA 


RODRICK 


ROSEMARY 


RUSS 


MYRNA 


NELSON 


NOAH 


OUVER 




RHONDA 


ROGER 


ROSEMUND 


RUSSELL 


MYRON 


NENA 


NOE 


OLIVIA 




RHYS 


ROLAND 


ROSEMUNDE 


RUSTY 


MYRTLE 


NERO 


NOEL 


OLLEN 




RICARD 


ROLF 


ROSENA 


RUTH 


NADENE 


NERSES 


NOLA 


OLLIE 


45 


RICH 


ROLFE 


ROSETTA 


RUTHANNA 


NADIA 


NESTOR 


NOLAN 


OLOF 




RICHARD 


ROLLO 


ROSIE 


RUTHANNE 


NADINE 


NETTIE 


NONA 


OMER 




RICHELLE 


ROMAN 


ROSINA 


RUTHIE 


NADJA 


NEVILLE 


NONNIE 


ONAL 




RICHIE 


ROMEO 


ROSLYN 


RUTHLYN 


ONEL 


_ PAGE 


PENNY- . . 


PRUDENCE__ 




.RICK 


ROMULUS 


_ROSS 


RYAN 


ONNIK 


PAIGE 


PER 


PRUE 




RICKEY 


RON 


ROSWELL 


SABA 


OPAL 


PAM 


PERCIVAL 


PRUNELLA 


50 


RICKI 


RONA 


ROSWITHA 


SABINA 


OPEL 


PAMELA 


PERCY 


QUEEN 




RICKIE 


RONALD 


ROULETTE 


SABINE 


OPHELIA 


PANSY 


PERRY 


QUEENIE 




RICKY 


RONDA 


ROWENA 


SABRTNA 


OPRAH 


PAOLO 


PERSIS 


QUEENY 




RIKKI 


RONDELL 


ROWLAND 


SADIE 


ORA 


PARL 


PETE 


QUENTTN 




RILEY 


RONETTE 


ROXANNE 


SAL 


ORAL 


PARNELL 


PETER 


QUINCY 




SALU 


SHARI 


SIDNEY 


STACIE 


ORAN 


PARRY 


PETRA 


QUINNIE 


55 


SALLIE 


SHARLENE 


SIEGFRIED 


STACY 


OREN 


PASCAL 


PETRO 


RACHEL 


SALLY 


SHARON 


SIG 


STAN 


ORESTE 


PASCHAL 


PETROS 


RACHELLE 




SALLYE 


SHARYN 


SIGFRIED 


STANISLAW 


ORIN 


PASQUALE 


PHEBE 


RAE 




SALOMON 


SHAUN 


SIGMUND 


STANLEY 


ORLAN 


PAT 


PHEUM 


RAIFORD 




SAM 


SHAWN 


SIGNE 


STANLY 


ORLEN 


PATIENCE 


PHIDIAS 


RALPH 




SAMANTHA 


SHAWNA 


SIGURD 


STEFAN 


ORLIN 


PATRICE 


PHIL 


RAMONA 


60 


SAMMIE 


SHAYNE 


SILAS 


STEFFI 


ORLYN 


PATRICIA 


PHILBERT 


RANDAL 


SAMMY 


SHEARL 


SILVIA 


STELLA 


ORPHA 


PATRICK 


PHILIP 


RANDALL 




SAMUAL 


SHEBA 


SILVIO 


STEPHAN 


ORSON 


PATSY 


PHILIPPE 


RANDI 




SAMUEL 


SHEENA 


SIMEON 


STEPHANIE 


ORTON 


PATTI 


PHILLIP 


RANDIE 




SANDI 


SHEILA 


SIMON 


STEPHEN 


ORTRUD 


PATTTE 


PHILO 


RANDOLF 




SANDIE 


SHELBY 


SIMONE 


STERLING 


ORVAL 


PATTY 


PH1NEAS 


RANDOLPH 




SANDRA 


SHELDEN 


SEMONNE 


STEVE 


ORVID 


PAUL 


PHOEBE 


RANDY 


65 


SANDY 


SHELDON 


SISSIE 


STEVEN 


ORVIL 


PAULA 


PHYLUS 


RANSOM 




SANJA 


SHELIA 


SISSY 


STEVIE 
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SARA 

SARAH 

SAUL 

SCHUYLER 

SCOT 

SCOTT 

SCOTTI E 

SCOTTY 

SEAMUS 

SEAN 

SEBASTIAN 

SEFERINO 

SELDEN 

SELENA 

SELENE 

SELINA 

SELMA 

SELMER 

SERENA 

SERINA 

SETH 

SEYMORE 

SEYMOUR 

SHANE 

SHANNON 

SY 

SYBIL 
SYD 

SYDNEY 

SYLVAN 

SYLVANUS 

SYLVENE 

SYLVESTER 

SYLVIA 

SYLVTE 

SYLVINA 

TABATHA 

TABITHA 

TAD 

TAFFY 

TALLULAH 

TAMARA 

TAMI 

TAMMI 

TAMMIE 

TAMMY 

TANCRED 

TAMA 

TANJA 

TANYA 

TASHA 

TATE 

TATIANA 



TAUBA 

TED 

TEDD 

TEDDY 

TEE 

TELMA 

TENA 

TENCH 

TERENCE 

TERESA 

TERRANCE 

TERRELL 

TERRENCE 

VANDY 

VANBSSA 

VAUGHAN 

VAUGHN 

VEDA 

VELLA 

VELMA 

VELVA 

VEOLA 



SHELLEY 

SHELLY 

SHELTON 

SHERI 

SHERIDAN 

SHERIE 

SHERILYN 

SHERL 

SHERLE 

SHERLEE 

SHERMAN 

SHERON 

SHERREE 

SHERRI 

SHERRIE 

SHERRY 

SHERWIN 

SHERYL 

SHIRL 

SHIRLE 

SHIRLEE 

SHIRLEY 

SI 

SIBYL 

SID 

TERRI 

TERRILL 

TERRY 

TERRYL 

TESS 

TESSIE 

TESSY 

TEX 

THAD 

THADDEUS 

THADEUS 

THARON 

THEA 

THEDA 

THE IMA 

THELMA 

THEO 

THEOBALD 

THEODIS 

THEODOR 

THEODORA 

THEODORE 

THEODORIS 

THEODOSIA 

THEONE 

THEORA 

THEOnS 

THERES A 

THERESIA 

THERESSA 

THOM 

THOMAS 

THOMASINA 

THOR 

THORA 

THORE 

THORNTON 

THORVALD 

THOS 

THURMAN 

THURMOND 

VINCENT 

VINNIE 

VINNY 

VIOLA 

VIOLET 

VIRACE 

VIRDA 

VIRGIE 

VIRGIL 



SKEET 

SKIPPIE 

SKIPPY 

SKYLER 

SLIM 

SMEDLEY 

SOFIE 

SOL 

SOLOMAN 

SOLOMON 

SONDRA 

SONIA 

SONJA 

SONNY 

SONYA 

SPARKY 

SPENCE 

SPENCER 

SPENSER 

SPIRO 

SPIROS 

SPYROS 

STACEY 

STACI 

STACIA 

THURSTAN 

THURSTON 

TIBOR 

TIFFANY 

TILLIE 

TILLY 

TIM 

TIMMY 

TIMO 

TIMOTHY 

TINA 

TINO 

TIPHANIE 

TIPHANY 

TITO 

TITUS 

TOBIAS 

TOBY 

TODD 

TOLLIE 

TOLUVER 

TOLLY 

TOM 

TOMMIE 

TOMMY 

TONEY 

TOM 

TOMA 

TONY 

TONYA 

TOOTTE 

TOVE 

TRACEE 

TRACEY 

TRACI 

TRACIE 

TRACY 

TRAVER 

TRAVIS 

TRENA 

TRENT 

WAYMAN 

WAYNE 

WELDEN 

WELDON 

WELLS 

WENDEL 

WENDELL 

WENDI 

WENDY 



STEWART 
STU 

STUART 
SUANN 
SUANNE 
SUE 

SUELLEN 

SUMNER 

SUNNIE 

SUNNY 

SUSAN 

SUSANA 

SUSANNA 

SUSANNAH 

SUSANNE 

SUSETTE 

SUZAN 

SUZANNE 

SUZELLEN 

SUZETTE 

SUZI 

SUZIE 

SUZY 

SVEN 

SWEN 

TREVOR 

TREY 

TRICIA 

TRILBY 

TRENA 

TRISH 

TRISTAM 

TRISTAN 

TREXIE 

TREXY 

TROY 

TRUDIE 

TRUDY 

TWALA 

TWILA 

TWYLA 

TYCHO 

TYCHUS 

TYCUS 

TYRONE 

UDE 

UDY 

ULRICH 

ULYSSES 

UNA 

URA 

URBAIN 

URIAS 



URSULA 

VACHEL 

VADA 

VAL 

VALDA 

VALENTIN 

VALENTINE 

VALENTINO 

VALERIA 

VALERIE 

VALTER 

VANCE 

VANDER 

WINNIFRED 

WINNY 

WINONA 

WINSLOW 

WINSTON 

WINTHROP 

WINTON 

WM 

WOLFGANG 



15 



20 



30 



35 



40 



List E 

Example of ENGLISH FIRST-NAME LIST 
Copyright 1995 
LEXIS-NEXIS, a Division of Reed Elsevier Inc. 



VERA 


VIRGINIA 


WERNER 


WOODIE 


VERDEEN 


VIRGIMUS 


WES 


WOOD ROW 


VERDELL 


VITA 


WESLEY 


WOODRUFF 


VERGA 


vrnNA 


WESLIE 


WOODY 


VERGIL 


VTITNO 


WESTLEY 


WYATT 


VERLIE 


VTTO 


WHITNEY 


WYLA 


VERLIN 


VnTORIA 


WIDDIE 


XAVIER 


VERLON 


VIVIAN 


WILBER 


XAVIERA 


VERLYN 


VIVIANNE 


WILBERFORCE 


YARDLEY 


VERN 


VIVIEN 


WILBERT 


YETTA 


VERNA 


VLAD 


WILBUR 


YOLANDA 


VERNARD 


VLADIMIR 


WILDA 


YOSEF 


VERNE 


VOL 


WILEY 


YVES 


VERNEST 


VON 


WILFORD 


YVETTB 


VERNESTINE 


VONDA 


WILFRED 


YVONNE 


VERNTCE 


VONNA 


WILHELM 


ZACHARIA 


VERNIE 


WADE 


WILHELMENA 


ZACHARIAH 


VERNON 


WALDEMAR 


WILHELMINA 


ZACHARY 


VERONICA 


WALDO 


WILHEMENA 


ZACK 


VERSA 


WALDORF 


WILHEMINA 


ZALPH 


VERSIE 


WALLACE 


WILLARD 


ZANE 


VI 


WALLIE 


WILLIAM 


ZEB 


VIC 


WALLY 


WILUE 


ZE BAD I AH 


VICKI 


WALT 


WILLIS 


ZEBEDEE 


VICKIE 


WALTER 


WILLMA 


ZECHARIAH 


VICKY 


WANDA 


WILLY 


ZEF 


VICTOR 


WARD 


WILMA 


ZEFF 


VICTORIA 


WARREN 


WILMAR 


ZEKE 


VIDAL 


WASHING- 


WILMOT 


ZELDA 




TON 






VIE 


WAYLAN 


WINFRED 


ZELIA 


VILMA 


WAYLEN 


WINIFRED 


ZELIG 


VINCE 


WAYLON 


WINNIE 


ZELL 



ZELLA 
ZELLE 
ZELMA 
ZENA 
ZENITH 
ZENO 
ZENOBIA 
ZENON 
ZEPHERY 
ZETA 
ZETTA 
ZEV 
Z1LLA 
ZILLAH 
ZINA 
45 ZTTA 
ZTVKO 
ZOE 
ZOLLIE 
ZOLLY 



ZORA 
50 ZULA 

ZYGMUND 
ZYGMUNT 



Modifications and variations of the above-described 
55 embodiments of the present invention are possible, as appre- 
ciated by those skilled in the art in light of the above 
teachings. As mentioned, any of a variety of hardware 
systems, memory organizations, software platforms, and 
programming languages may embody the present invention 
without departing from its spirit and scope. Moreover, 
60 countless variations of the Partition List, company 
indicators, product names, organization indicators, English 
first name list, and resulting Phrase Lists, and the like, may 
be employed or produced while remaining within the scope 
of the invention. It is therefore to be understood that, within 
65 the scope of the appended claims and their equivalents, the 
invention may be practiced otherwise than as specifically 
described. 
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What is claimed is: 

1. A computer-implemented method of processing a 
stream of document text to form a list of phrases that are 
indicative of conceptual content of the document, the 
phrases being used as index terms and search query terms in 
full text document searching performed after the phrase list 
is formed, the method comprising: 

partitioning the document text into plural chunks of 
document text, each chunk being separated by at least 
one partition entity from a partition list; and 

selecting certain chunks as the phrases of the phrase list, 
based on frequencies of occurrence of the chunks 
within the stream of document text. 

2. The method of claim 1, wherein the partitioning step 
includes: 

scanning a portion of the document text stream; 
comparing the scanned portion of the document text 

stream to partition entities in the partition list; 
substituting a partition tag for portions of the document 

text stream which match a partition entity; 
generating a text chunk list; 

scanning the text chunk list to determine a frequency of 
each text chunk in the text chunk list; and 

revising the text chunk list to include the respective 
frequencies of occurrence in association with the text 
chunks. 

3. The method of claim 1, wherein the selecting step 
includes: 

selecting the certain chunks as the phrases of the phrase 
list based only on the frequencies of occurrence of the 
chunks within the stream of document text and on a 
quantity of words within the chunks. 

4. The method of claim 1, wherein: 

a) the partitioning step includes: 

al) scanning a portion of the document text stream; 

a2) comparing the scanned portion of the document text 
stream to partition entities in the partition list; 

a3) substituting a partition tag for portions of the 
document text stream which match a partition entity; 

a4) generating a text chunk list; 

a5) scanning the text chunk list to determine a fre- 
quency of each text chunk in the text chunk list; and 

a6) revising the text chunk list to include the respective 
frequencies of occurrence in association with the text 
chunks; and 

b) the selecting step includes selecting the certain chunks 
as the phrases of the phrase list based only on the 
frequencies ~of "occurrence of the _ chunks- within the- 
stream of document text and on a quantity of words 
within the chunks. 

5. The method of claim 1, wherein the selecting step 
includes: 

excluding a chunk from being determined as a phrase if 
the chunk is a single word beginning with a lower case 
letter. 

6. The method of claim 1, wherein the selecting step 
includes: 

determining a chunk as being a phrase if the chunk 
includes a plurality of words each constituting lower 
case letters only if the chunk occurs at least twice in the 
document text stream. 

7. The method of claim 1, wherein the selecting step 
includes: 

determining a chunk as being a proper name if the chunk 
includes a plurality of words each having at least a first 
letter which is upper case. 
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8. The method of claim 1, wherein the selecting step 
includes:- 

mapping a sub-phrase to a phrase. 

9. The method of claim 1, wherein the selecting step 
includes: 

mapping single upper case words to their respective 
proper names. 

10. The method of claim 1, wherein the selecting step 
includes: 

detecting presence of acronyms; 

incrementing a count of a proper name corresponding to 

the respective detected acronyms; and 
copying the proper name and the acronym to an acronym 

list. 

11. The method of claim 1, wherein the selecting step 
includes: 

combining a phrase list of lower case words with a phrase 
list of proper names. 

12. The method of claim 1, further comprising: 
reducing the phrase list by consolidating phrases in the 

phrase list by using a synonym thesaurus. 

13. The method of claim 1, further comprising: 
adding phrases to the phrase list by combining phrases 

which are separated in the document text stream only 
by prepositions. 

14. The method of claim 1, further comprising: 
trimming the phrase list by eliminating phrases which 

occur in fewer than a threshold number of document 
text streams. 

15. The method of claim 1, further comprising: 
categorizing proper names in the proper name list into 

groups based on corresponding group lists. 

16. An apparatus of processing a stream of document text 
to form a list of phrases that are indicative of conceptual 
content of the document, the phrases being used as index 
terms and search query terms in full text document searching 
performed after the phrase list is formed, the apparatus 
comprising: 

means for partitioning the document text into plural 
chunks of document text, each chunk being separated 
by at least one partition entity from a partition list; and 

means for selecting certain chunks as the phrases of the 
phrase list, based on frequencies of occurrence of the 
chunks within the stream of document text. 

17. The apparatus of claim 16, wherein the partitioning 
means includes: _ _ 

means for scanning a portion of the document text stream; 
means for comparing the scanned portion of the document 

text stream to partition entities in the partition list; 
means for substituting a partition tag for portions of the 

document text stream which match a partition entity; 
means for generating a text chunk list; 
means for scanning the text chunk list to determine a 

frequency of each text chunk in the text chunk list; and 
means for revising the text chunk list to include the 

respective frequencies of occurrence in association 

with the text chunks. 

18. The apparatus of claim 16, wherein the selecting 
means includes: 

means for selecting the certain chunks as the phrases of 
the phrase list based only on the frequencies of occur- 
rence of the chunks within the stream of document text 
and on a quantity of words within the chunks. 
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19. The apparatus of claim 16, wherein: 

a) the partitioning means includes: 

al) means for scanning a portion of the document text 
stream; 

a2) means for comparing the scanned portion of the 
document text stream to partition entities in the 
partition list; 

a3) means for substituting a partition tag for portions of 
the document text stream which match a partition 
entity; 

a4) means for generating a text chunk list; 

a5) means for scanning the text chunk list to determine 

a frequency of each text chunk in the text chunk list; 

and 

a6) means for revising the text chunk list to include the 
respective frequencies of occurrence in association 
with the text chunks; and 

b) the selecting means includes means for selecting the 
certain chunks as the phrases of the phrase list based 
only on the frequencies of occurrence of the chunks 
within the stream of document text and on a quantity of 
words within the chunks. 

20. A computer-readable memory which, when used io 
conjunction with a computer, can carry out a phrase recog- 
nition method to form a phrase list containing phrases that 
are indicative of conceptual content of a document, the 
phrases being used as index terms and search query terms io 
full-text document searching performed after the phrase list 
is formed, the computer-readable memory comprising: 

computer-readable code for partitioning document text 
into plural chunks of document text, each chunk being 
separated by at least one partition entity from a parti- 
tion list; and 

computer-readable code for selecting certain chunks as 
the phrases of the phrase list based on frequencies of 
occurrence of the chunks within the stream of docu- 
ment text. 

21. The computer-readable memory of claim 20, wherein 
the computer-readable code for partitioning includes: 

computer-readable code for scanning a portion of the 

document text stream; 
computer-readable code for comparing the scanned por- 
tion of the document text stream to partition entities in 

the partition list; 
computer-readable code for substituting a partition tag for 

portions of the document text stream which match a 

— - partition-entity; _ . 

computer-readable code for generating a text chunk list; 
computer-readable code for scanning the text chunk list to 

determine a frequency of each text chunk in the text 

chunk list; and 
computer-readable code for revising the text chunk list to 

include the respective frequencies of occurrence in 

association with the text chunks, 

22. The computer-readable memory of claim 20, wherein 
the computer-readable code for selecting includes: 

computer-readable code for selecting the certain chunks 
as the phrases of the phrase list based only on the 
frequencies of occurrence of the chunks within the 
stream of document text and on a quantity of words 
within the chunks. 

23. The computer-readable memory of claim 20, wherein: 
a) the computer-readable code for partitioning includes: 

al) computer-readable code for scanning a portion of 
the document text stream; 
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a2) computer-readable code for comparing the scanned 

portion of the document text stream to partition 

entities in the partition list; 
a3) computer-readable code for substituting a partition 

tag for portions of the document text stream which 

match a partition entity; 
a4) computer-readable code for generating a text chunk 

list; 

a5) computer-readable code for scanning the text chunk 
list to determine a frequency of each text chunk in 
the text chunk list; and 
a6) computer-readable code for revising the text chunk 
list to include the respective frequencies of occur- 
rence in association with the text chunks; and 
b) the computer-readable code for selecting includes 
computer-readable code for selecting the certain 
chunks as the phrases of the phrase list based only on 
the frequencies of occurrence of the chunks within the 
stream of document text and on a quantity of words 
within the chunks. 

24. A computer-implemented method of full-text, on-line 
searching, the method comprising: 

a) receiving and executing a search query to display at 
least one current document; 

b) receiving a command to search for documents having 
similar conceptual content to the current document; 

c) executing a phrase recognition process to extract 
phrases allowing full text searches for documents hav- 
ing similar conceptual content to the current document, 
the phrase recognition process including the steps of: 
cl) partitioning the document text into plural chunks of 

document text, each chunk being separated by at 
least one partition entity from a partition list; and 
c2) selecting certain chunks as the phrases, based on 
frequencies of occurrence of the chunks within the 
stream of document text; and 

d) automatically forming a second search query based at 
least on the phrases determined in the phrase recogni- 
tion process so as to allow automated searching for 
documents having similar conceptual content to the 
current document. 

25. The method of claim 24, further comprising: 
validating phrases recognized by the phrase recognition 

process against phrases in a phrase dictionary before 
automatically forming the second search query. 

26. The method of claim 24, further comprising: 

- displaying -an- error- message-if- less than_a_ thresholds 
number of phrases are recognized for the current docu- 
ment. 

27. A computer-implemented method of forming a phrase 
list containing phrases that are indicative of conceptual 
content of each of a plurality of documents, which phrases 
are used as index terms or in document search queries 
formed after the phrase list is formed, the method compris- 
ing: 

a) selecting document text from the plurality of docu- 
ments; 

b) executing a phrase recognition process including the 
steps of: 

bl) partitioning the document text into plural chunks of 
document text, each chunk being separated by at 
least one partition entity from a partition list; and 

b2) selecting certain chunks as the phrases, based on 
frequencies of occurrence of the chunks within the 
stream of document text; and 
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c) forming the phrase list, wherein the phrase list includes: 

1) phrases extracted by the phrase recognition process; 
and 

2) respective frequencies of occurrence of the extracted 
phrases. 

28. The method of 27, further comprising: 

forming a modified phrase list having only those phrases 
whose respective frequencies of occurrence are greater 
than a threshold number of occurrences. 

29. The method of 27, further comprising: 

forming a phrase dictionary based on the phrase list 
formed in the forming step. 

30. A computer-implemented method of forming phrase 
lists containing phrases that are indicative of conceptual 
content of documents, which phrases are used as index terms 
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or in document search queries formed after the phrase list is 
formed, the method comprising 

a) selecting document text from a sampling of documents 
from among a larger collection of documents; and 

b) executing a phrase recognition process to extract 
phrases to form a phrase list for each document 
processed, the phrase recognition process including: 
bl) partitioning the document text into plural chunks of 

document text, each chunk being separated by at 
least one partition entity from a partition list; and 
b2) selecting certain chunks as the phrases of the phrase 
list based on frequencies of occurrence of the chunks 
within the stream of document text. 

* * * * * 
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ABSTRACT 



Scatter-Gather is a computer based document browsing 
method which operates in time proportional to a num- 
ber of documents in a target corpus. The Scatter-Gather 
method includes: preparing an initial ordering of the 
corpus using, for example, an off-line computational 
method; determining a summary of the initial ordering 
of the corpus for interactive utility; and providing a 
further ordering of the corpus using, for example, an 
on-line non-deterministic method. The step of an off- 
line preparation of an initial ordering of a corpus is 
non-time-dependent, thus an accurate initial ordering is 
prepared. The step of determining a summary includes 
determining a summary for presentation to a user with- 
out scrolling on a CRT. The step of providing a further 
ordering includes truncated group average agglomerate 
clustering, merging disjointed document sets, center 
finding, assign-to-nearest and other refinement meth- 
ods. 
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measure has less qualitative impact on clustering results 
than the choice of clustering procedure. 

A wide range of clustering procedures have been 
applied to documents including, most prominently, 
single-linkage hierarchical clustering. Hierarchical clus- 
tering procedures proceed by iteratively considering all 
pairs of similarities, and fusing the pair which exhibits 
the greatest similarity. They differ in the procedure 

ments, which is applicable over all natural languages In ^ * d ?™° ^ hen ° ne of * e P*? * * 

♦u * «-* ■ i-i i • i- f 10 document group, lc, the product of a previous fusion, 

that contam a lexical analysts capabuity. Single-linkage clustering defines the similarity as the 

maximum similarity between any two individuals, one 
from each half of the pair. Alternative methods consider 
the minimum similarity (complete linkage), the average 



SCATTER-GATHER: A CLUSTER-BASED 
METHOD AND APPARATUS FOR BROWSING 
LARGE DOCUMENT COLLECTIONS 

BACKGROUND OF THE INVENTION 

The present invention relates to a document-cluster 
ing-based browsing procedure for a corpus of docu 
-hcable over all nat 
[ analysis capability. 
Document clustering has been extensively investi- 
gated as methodology for improving document search 
and retrieval. The general assumption is that mutually 



similar documents will tend to be relevant to the same 15 similarity (group average linkage), as well' as other 4 



queries and hence, automatic determination of groups 
of such documents can improve recall by effectively 
broadening a search request Typically a fixed corpus of 
documents is clustered either into an exhaustive parti- 



gregate measures. Although single-linkage clustering is 
known to have an unfortunate criaining behavior, typi- 
cally forming elongated straggly clusters, it continues 
to be popular due to its simplicity, and the availability of 



tion, disjoint or otherwise, or into an hierarchical tree 20 an optimal space/time procedure for its determination, 
structure. In the case of a partition, queries are matched 
against clusters, and the contents of some number of the 
best scoring clusters are returned as a result, possibly 
sorted by score. In the case of a hierarchy, queries are 
processed downward, always taking the highest scoring 25 
branch, until some stopping condition is achieved The 
subtree at that point is then returned as a result 

Hybrid strategies are also available, which are essen- 
tially variations of near-neighbor search, where near- 
ness is defined in terms of the pairwise document simi- 
larity measure used to generate the clustering. Indeed, 
cluster search techniques are typically compared to 
similarity search, a direct near-neighbor search, and are 
evaluated in terms of precision and recall. Various stud- 
ies have indicated that cluster search strategies are not 
markedly superior to smiilarity search and, in some 
situations, can be inferior. It is therefore not surprising 
that cluster search, given its indifferent performance, 
and the high dctenninable cost of clustering large cor- 
pora, has not gained wide popularity. 

Document clustering has also been studied as a 
method for accelerating similarity search, but the devel- 
opment of fast procedures for near-neighbor searching 
has decreased interest in that possibility. 

In order to cluster documents, one must first establish 
a pairwise measure of document similarity and then 
define a method for using that measure to form sets of 
similar documents, or clusters. Numerous -document. 



30 



35 



40 



45 



Standard hierarchical document clustering tech- 
niques employ a document similarity measure and con- 
sider the similarities of all pairs of documents in a given 
corpus. Typically, the most similar pair is fused and the 
process iterated, after suitably extending the similarity 
measure to operate on agglomerations of documents as 
well as individual documents. The final output is a bi- 
nary tree structure that records the nested sequence of 
pairwise joints. Traditionally, the resulting trees had 
been used to improve the efficiency of standard boolean 
or relevance searches by grouping together similar 
documents for rapid access. The resulting trees have 
also lead to the notion of cluster search in which a query 
is matched directly against nodes in the cluster tree and 
the best matching subtree is returned. Counting all 
pairs, the cost of constructing the cluster trees can be no 
less than proportional to N 2 , where N is the number of 
documents in the corpus. Although clustering experi- 
ments have been conducted on corpora with documents 
numbering in the low tens of thousands, the intrinsic 
order of these clustering procedures works against the 
expectation that corpora will continue to increase in 
size. Similarly, although cluster searching has shown 
some promising results, the method tends to favor the 
most d eterminati onally expensive similarity measures 
and seldom yields greatly increased performance over 
other standard methods. 

Hierarchical methods are intrinsically quadratic in 
the number of documents to be clustered,- 



55 



similarity measures have been proposed all of which ^ ""T "^"^ .Y 1 ^™™» «™; --^uaeu^because -all- 
Miuii^iLy measure* udvc ueen proposeo an oi wrncn 50 pairs Q f arnilanties must be considered. This sharply 
consider the degree of word overlap between the two ^ ^ cven ^ proccdures ^ S 

this theoretical upper bound on performance. Parti- 
tional strategies (those that strive for a flat decomposi- 
tion of the collection into sets of documents rather than 
a hierarchy of nested partitions) by contrast are typi- 
cally rectangular in the size of the partition and the 
number of documents to be clustered. Generally, these 
procedures proceed by choosing in some manner, a 
number of seeds equal to the desired size (number of 
sets) of the final partition. Each document in the collec- 
tion is then assigned to the closest seed. As a refinement 
the procedure can be iterated with, at each stage, a 
hopefully improved selection of cluster seeds. How- 
ever, to be useful for cluster search the partition must be 
fairly fine, since it is desirable for each set to only con- 
tain a few documents. For example, a partition can be 
generated whose size is related to the number of unique 
words in the document collection. From this perspec- 



i degree of word overlap 
documents of interest, described as sets of words, often 
with frequency information. These sets are typically 
represented as sparse vectors of length equal to the 
number of unique words (or types) in the corpus. If a 
word occurs in a document, its location in this vector is 
occupied by some positive value (one if only presen- 
ce/absence information is considered, or some function 
of its frequency within that document if frequency is 
considered). If a word does not occur in a document, its eo 
location in this vector is occupied by zero. A popular 
similarity measure, the cosine measure, determines the 
cosine of the angle between these two sparse vectors. If 
both document vectors are normalized to unit length, 
this is of course, simply the inner product of the two 65 
vectors. Other measures include the Dice and Jaccard 
coefficient which are normalized word overlap counts. 
It has also been suggested that the choice of similarity 
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tive, the potential determinable benefits of a parti tional first step is to select a number of most frequently occur- 

strategy are largely obviated by the large size (relative ring values of at least part of a key of the index. The 

to the number of documents) of the required partition. number is greater than zero and less than the total num- 

For this reason partitkmal strategies have not been ag- ber of such values. Statistics on the frequency of occur- 

gressively pursued by the information retrieval commu- 5 rence of the selected values are collected. An estimate 

nity. of the time required to use the index as the access path 

The standard formulation of cluster search presumes is made, based at least in part on the index's most fre- 
a query, the user's expression of an information need. quently occurring values statistics. The estimate is used 
The task is then to search the collection of documents as the basis at least in part for selecting an access path 
that match this need. However, it is not difficult to 10 for the query. The database optimizer described is hier- 
imagine a situation in which it is hard, if not impossible archicaDy organized in order of word frequency, 
to formulate such a query. For example, the user may "Recent trends in hierarchic document clustering: A 
not be familiar with the vocabulary appropriate for critical review" by Peter WiHett, Information Process- 
describing a topic of interest, or may not wish to com- ing of Management, VoL 24, No. 5, pages 577-97 
mit himself to a particular choice of words. Indeed, the 15 (1988 — printed in Great Britain) describes the calcula- 
user may not be looking for anything specific at all, but tion of interdocument similarities and clustering meth- 
rather may wish to gain an appreciation for the general ods that are appropriate for document clustering. The 
information content of the collection. It seems appropri- article further discusses procedures that can be used to 
ate to describe this as browsing rather than search, since allow the implementation of the aforementioned meth- 
it is at one extreme of a spectrum of possible information 20 ods on databases of nontrivial size. The validation of 
access situations, ranging from requests for specific document hierarchies is described using tests based on 
documents to broad, open-ended questions with a vari- the theory of random graphs and on empirical charac- 
ety of possible answers. Standard information access tens tics of document collections that are to be clus- 
techniques tend to emphasize search. This is especially tered. A range of search strategies is available for re- 
clearly seen in cluster search where a technology capa- 25 trieval from document hierarchies and the results are 
ble of topic extraction, i.e., clustering, is submerged presented in a series of research projects that have used 
from view and used only as an assist for near-neighbor these strategies to search a cluster resulting from several 
searching. different types of hierarchic agglomerative clustering 

In proposing an alternative application for clustering methods. The article suggests that a complete linkage 

in information access we take our inspiration from the 30 method is probably the most effective method in terms 

access methods typically provided with a conventional of retrieval performance; however, it is also difficult to 

text book. If one has a specific question in mind, and implement in an efficient manner. Other applications of 

specific terms which define that question, one consults document clustering techniques are discussed briefly; 

an index, which directs one to passages of interest, experimental evidence suggests that nearest neighbor 

keyed by search words. However, if one is simply inter- 35 clusters, possibly represented as a network model, pro- 

ested in gaining an overview, one can turn to the table vide a reasonably efficient and effective means of in- 

of contents which lays out the logical structure of the eluding interdocument similarity information in docu- 

text for perusal The table of contents gives one a sense ment retrieval systems. 

of the types of questions that might be answered if a "Understanding Multi-Articled Documents" by 
more intensive examination of the text were attempted, 40 Tsujimoto et aL, presented in June 1990 in Atlantic 
and may also lead to specific sections of interest. One City, NJ. at the 10th International Conference for Pat- 
can easily alternate between browsing the table of con- tern Recognition, describes an attempt to build a 
tents, and searching the index. method to understand document layouts without the 
By direct analogy, an information access system is assistance of character recognition results, i.e., the 
proposed herein, which can have, for example, two 45 meaning of contents. It is shown that documents have 
components: a browsing tool which uses a cluster- an obvious hierarchical structure in their geometry 
based, dynamic table-of-contents metaphor for navigat- which is represented by a tree. A small number of rules 
ing a collection of documents; and one or more word- are introduced to transform the geometric structure into 

-based; directed text search tools, such -as similarity thelogical-Stincture ^which^ represents_the_ semantics 

search, or the search technique described in U.S. patent 50 carried by the documents. A virtual field separator 

application Ser. No. 07/745,794 to Jan O. Pedersen et al technique is employed to utilize information carried by 

filed Aug. 16, 1991 , and entitled An Iterative Technique a special constituent of documents such as field separa- 

For Phrase Query Formation and an Information Re- tors and frames, keeping the number of transformation 

trieval System Employing Same. The browsing tool rules small, 

describes groups of similar documents, one or more of 55 

which can be selected for further refinement This se- SUMMARY OF THE INVENTION 
lection/refinement process can be iterated until the user In the basic iteration of the proposed S carter-Gather 
is directly viewing individual documents. Based on browsing tool, which can be defined as a method and 
documents found in this process, or on terms used to apparatus for browsing large document collections, the 
describe document groups, the user may at any time 60 user is presented with the descriptions (or summaries) of 
switch to a more focused search method. In particular it a fixed number of document groups. Based on these 
is anticipated that the browsing tool will not necessarily summaries, the user selects one or more of these groups 
be used to find particular documents, but may instead for further study. These selected groups are gathered 
assist the user in formulating a search request, which together to form a subcollection. The system then scat- 
will then be evaluated by some other means. 65 ters (or reclusters) the new subcollection into the same 
U.S. Pat No. 4,956,774 to Shibamiya et al. discloses a fixed number of document groups, which are again 
method for selecting an access path in a relational data* presented to the user. With each successive iteration, 
base management system having at least one index. The since the number of documents decreases through selec- 
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tion while the number of groups remains fixed, the in a cluster, and typical documents are those close to the 

groups become smaller and therefore more detailed. cluster centroid (or trimmed centroid). 

Ultimately, when the groups become small enough, this „ _ 

process bottoms out by viewing individual documents. BRIEF DESCRIPTION OF THE DRAWINGS 

A history mechanism allows the user to backup from 5 The invention will be described in detail with refer- 
the results of an iteration, to either try a different selec- ence to the following drawings in which line reference 

tion or to backup further. numerals refer to like elements. 

Scatter-Gather depends on the existence of two facili- FIG. 1 is a block diagram of the hardware compo- 
ties. First, since clustering and reclustering is an essen- nents used to practice the present invention, 
tial part of the basic iteration, there is clearly a require- 10 FIG. 2 is a high level flowchart of a preferred em- 
inent for a procedure which can cluster a large number . bodiment of the Scatter-Gather document browsing 
of documents within a time tolerable for user interac- method according to the present invention, 
tion (typically no more than about 60 seconds). For this FIG. 3 is an illustrative diagram of a preferred em- 
strategy to scale with corpus size, some provision must bodiment of the Scatter-Gather document browsing 
also be made for dealing with corpora larger than can 15 method of the present invention being applied to a cor- 
be clustered within this time allowance. Second, given a pus of documents consisting of Grolier's Encyclopedia, 
group of documents, some method for automatically FIG- 4 is an intermediate level flowchart of a first 
summarizing the group must be specified. This cluster embodiment of a Fractionation initial partitioning 
description must be sufficiently revealing for a user to method of the present invention for preparing an initial 
gain a sense of the topic defined by the group, yet short 20 ordering of a corpus of documents as used in the Scat- 
enough for many descriptions to be appreciated simulta- ter-Gather document browsing method, 
neously. FIG. 5 is a high level flowchart, according to the 

Three procedures are disclosed which can be used present invention, of an embodiment of a Cluster Digest 

together to satisfy the above requirements, or used sepa- method for determining a summary of an ordering of a 

rately to enhance other corpus partitioning and/or corpus of documents in the Scatter-Gather document 

searching techniques. A first procedure, Buckshot is a browsing method. 

very fast partition-type clustering method suitable for FIG. 6 is an intermediate level flowchart, according 

an online reclustering preferable for Scatter-Gather. to ^ present invention, of a preferred embodiment of a 

The Buckshot procedure employs three subprocedures. 3Q Buckshot method for providing a further, preferably 

The first subprocedure is truncated group average ag- on-line, ordering of a corpus of documents in the Scat- 

glomerative clustering which merges disjoint document ter-Gather browsing method, 

sets, or groups, starting with individuals, until only a DETAILED DESCRIPTION OF THE 

predetermined number of groups remain. At each step PREFERRED EMBODIMENT 

the two groups whose merger would produce the great- 35 

est average similarity are merged into a single new A - Overview 
group. The second subprocedure determines a trimmed The present invention can be implemented in a docu- 
sum profile from selected documents closest to a docu- ment corpus browsing system 12 as illustrated by block 
ment group centroid. The third subprocedure assigns diagram in FIG. 1. The system includes a central pro- 
individual documents to the closest center represented 43 cessing unit (microprocessor) 1 for receiving signals 
by one of the trimmed average profiles. from, and outputting signals to various other compo- 

A second procedure, Fractionation, is another more nents of the system, according to one or more programs 

precise, partition-type clustering method suitable for a run on microprocessor L The system includes a read 

static off-line partitioning of the entire corpus which only memory (ROM) 2 for storing operating programs, 

can be presented first to the user. Fractionation can be 45 A random access memory (RAM) 3 is provided for 

thought of as a more careful version of Buckshot that running the various operating programs, and additional 

trades speed for increased accuracy. In particular, the files 4 could be provided for overflow and the storage 

random sampling of a corpus of documents and the of partitioned corpora used by the present invention in 

partitioning of the random sample by truncated group —performing a search operation 

average agglomerative clustering (as provided in the 50 Prior to performing a browsing procedure, a docu- 
Buckshot procedure), are replaced by a deterrninistic ment corpus is input from a corpus input 5. The corpus 
center finding subprocedure (for Fractionation). Frac- is then partitioned by software controlling processor 1 
tionation also provides additional refinement iterations according to the teachings of the present invention, 
and procedures. Specifically, Fractionation provides Monitor 8 is provided for displaying results of parti- 
initial partitioning of a corpus of documents using a 55 tioning procedures, and for permitting the user to inter- 
deterrninistic center finding subprocedure. Then, by face with the operating programs. A user input device 9 
applying an Assign-to-Nearest subprocedure, a parti- such as, for example, a mouse, a keyboard, a touch 
tion of a desired size is determined. Finally, the parti- screen or combinations thereof is provided for input of 
tioning is refined through iterative refinement sub- commands by the operator. A printer 10 can also be 
procedures. ^0 provided so that hard copies of documents, as well as 

A third procedure, Cluster Digest is a cluster summa- print-outs containing Cluster Digest summaries, can be 

rization suitable as a cluster display method in Scatter- printed. 

Gather. For Scatter-Gather, it is desirable for a sum- The system 12 is based in a digital computer which 
mary to be constant in size, so a fixed number of summa- can implement an off-line preparation of an initial order- 
ries can be reliably fit onto a given display area. The 65 ing using, for example, the Fractionation method dis- 
Cluster Digest summary lists a fixed member of topical closed herein. The system 12 also determines a sum- 
words plus the document titles of a few typical docu- mary of the initial ordering of the corpus which can be 
ments, where topical words are those that often occur displayed to user via monitor 8 or printer 10 for user 
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interaction. This summary can be determined by, for at a finer level of detail than the top-level clusters. The 

example, using the Cluster Digest method disclosed user again selects clusters of interest In this case, these 

herein. After receiving appropriate instructions from a include clusters 26E, 25G and 26H labeled "aviation", 

user via user input device 9, system 12 can perform a "engineering", and * 'physics". Again, a further reduced 

further ordering of the corpus using, for example, the 5 corpus 28 is formed and reclustered. The final set of 

on-line Buckshot method described herein. clusters 30A-F includes clusters labeled for "military 

An illustration of Scatter-Gather in operation is aviation", "Apollo program", "aerospace industry", 

shown in FIG. 3. Here the text collection 20 is an on- "weather^, "astronomy" and "civil aviation". At this 

line-version of Grolier's encyclopedia (roughly 64 stage the clusters are small enough for direct perusal via 

Megabytes of ASCII text) with each of the twenty- 10 an exhaustive list of article titles. Assuming at least one 

seven thousand articles treated as a separate document article of interest is found, the user may find more arti- 

Suppose the user is interested in investigating the role of cles of a similar nature in the same cluster, or may use a 

women in the exploration of space. Rather than at- directed search method, based on the vocabulary of the 

tempting to express this information need as a formal found article or of the cluster description, to find addi- 

query, the user instead selects a number of the top-level 15 tional articles. 

clusters referenced as 22A-I that, from their descrip- The Scatter-Gather browsing technique can be dem- 

tion, seem relevant to the topic of interest In this case, onstrated in more detail through the following sample 

the user selects the clusters 22A, 22C and 22H labelled session performed with a prototype implementation, 

"military history", "science and industry", and "Ameri- The basic steps are illustrated in FIG. 2, with the output 

can society" to form a reduced corpus 24 of the indi- 20 data (output via a monitor or printer) being illustrated 

cated subset of articles from Grolier's, (Note, the cluster below. The corpus is still Grolier's encyclopedia and 

labels are idealized in this illustration; the actual imple- the goal is to find documents discussing space explorers, 

mentation produces cluster descriptions which are To create the initial partition (Step 16), the Buckshot 

longer than would fit conveniently in this figure. How- clustering method is first applied. Fractionation could 

ever, the given labels are reasonable glosses of topics 25 also be used for this task, time permitting. 



> (sctq first (time (outline (all-docs tdb)))) 

cluster 27581 items 

global cluster 525 items . . . 

cluster sizes 38 29 45 42 41 46 26 124 115 19 

assign to nearest . . . done 

cluster sizes 1838 1987 3318 2844 2578 2026 1568 4511 5730 1 181 
assign to nearest . . . done 

cluster sizes 1532 2206 3486 2459 2328 2059 1609' 4174 6440 1288 

Ouster (h Size: 1532. Focus articles: American music, music, history of; Ita 

focus terms: music, opus, composer, century, musical, play, dance, style, i 
Cluster 1: Size: 2206. Focus articles: Christianity; Germany, history o; church 

focus termsxhurch, king, roman, son, war, century, christian, emperor, jo 
Cluster 2: Size: 3486. Focus articles: French literature; English literature; 

focus termsmovel, trans, play, cng, writer, life, poet, american, poem, s 
Ouster 3: Size: 2459. Focus articles: education; university; graduate educate 

focus termsamtversity, study, school, state, american, theory, college, s 
Ouster 4: Size: 2328. Focus articles: plant; fossil record; mammal; growth; o 

focus tenns: water, year, cell, area, animal, body, disease, human, develop 
Ouster 5: Size: 2059. Focus articles- bird; shorebirds; flower; viburnum; cac 

focus termsspecie, fam ily, plant, flower, grow, genus, tree, leaf, white, 
Ouster 6: Size: 1609. Focus articles: radio astronomy; space exploration; sta 

focus terms right, star, earth, space, energy, surface, motion, line, fiel 
Ouster 7: Size 4174. Focus articles: Latin American art; art; American art a 

focus termsxity, century, art, plant, build, center, style, be museum, d 
Ouster 8: Size: 6440. Focus articles: United States, his; United States; Euro 

focus terms; state, war, unite, year, government, world, american, area, ri 
Ouster 9: Size 1288. Focus articles: chemistry, history; organic chemistry; 

focus tennsxhemical, element, compound, metal, numb, atom, water, process 
real time 465953 msec 



described by actual cluster summaries). 

The reduced corpus is then reclustered on the fly to 
produce a new set of clusters 26A-J covering the re- 
duced corpus 24. Since the reduced corpus contains a 
subset of the articles in Grolier's, these new clusters are 



Each cluster is described with a two line display, via 
an application of Cluster Digest (Step 17). Clusters 6 
(Astronomy), 8 (U.S. History) and 9 (Chemistry) are 
picked, as those which seem likely to contain articles of 
interest, Recluster (Step 18), and Display. 



> time (setq second (outline first 6 8 9») 

cluster 9337 items 

global cluster 305 items . . . 

cluster sizes 23 11 19 31 18 57 38 48 21 39 

assign to nearest . . . done 

cluster sizes 706 298 679 630 709 1992 980 1611 1 159 573 
a s s i gn to nearest . . . done 

cluster sizes 538 315 680 433 761 1888 1376 1566 1068 712 

Ouster 0: Size: 538. Focus articles: Liberal parties; political parties; Lab 

focus termsiparty, minister, government, prime, leader, war, state, politi 
Ouster 1: Size: 315. Focus articles: star; astronomy and astr; extragalactic 

focus terrasstar, sun, galaxy, year, earth, distance, light, astronomer, m 
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-continued 

Cluster 2: Size: 680. Focus articles: television; glass; aluminum; sound reco 

focus termsrprocess, metal, material, tight, type, mineral, color, device, 
Cluster 3: Sire 433. Focus articles: Laccadive Islands; French Southern an; 

focus terms: island, sq, area, population, south, west, state, coast, north 
Cluster 4: Size: 761. Focus articles: organic chemistry; chemistry, history; 

focus termsxhemical, element, compound, numb, atom, acid, reaction, water 
Cluster 5: Size: 1888. Focus articles: United States, his; Europe, history of; 

focus terms: war, state, world, unite, american, brittsh, army, government. 
Cluster 6: Size: 1376. Focus articles: president of the U; Democratic party; b 

focus termsxstate, president, law, court, year, unite, right, american, wa 
Cluster 7: Size: 1566. Focus articles: United States; North America; Australia 

focus termsstate, river, area, population, north, south, year, west, regi 
Cluster 8: Size: 1068. Focus articles: space exploration; radio astronomy; cle 

focus terms;energy, space, light, earth, particle, field, theory, motion. 
Cluster 9: Size: 712. Focus articles: corporation; monopoly and corape; govern 

focus terms:company, state, unite, product, bilKon, year, service, sale, 
real time 186338 msec 



Iteration occurs, this time selecting clusters 1 (As- Space exploration is well separated from Astronomy 
tronomy) and 8 (Space Exploration and Astronomy). in cluster 7, thus Space Exploration is picked by the 



> (time (setq third (outline second 1 8))) 

cluster 1383 items 

global cluster 1 17 items . . . 

cluster sizes 12 4 8 15 12 7 8 22 9 20 

move to nearest . . . done 

cluster sizes 172 63 70 236 76 131 75 198 124 238 

move to nearest . . . done 

cluster sizes 176 83 86 205 86 134 84 187 132 210 

Cluster 0: Size: 176. Focus articles: thermodynamics; light; energy; optics; 

focus termsrenergy, heat, light, temperature, gas, wave, motion, air, pres 
Cluster I: Size: 83. Focus articles: analytic geometry, line; geometry; coor 

focus terms:point, line, plane, angle, circle, geometry, coordinate, curve 
Cluster 2: Size: 86. Focus articles: radio astronomy; observatory, astro; te 

focus tenns:telesoope, observatory, radio, instrument, astronomy, light, s 
Ouster 3: Size: 205. Focus articles: solar system; Moon; planets; astronomy, 

focus terms:earth, sun, solar, planet, moon, satellite, orbit, surface, ye 
Cluster 4: Size: 86. Focus articles: radio, magnetism; circuit, electric; ge 

focus tcnnsrfield, frequency, magnetic, electric, electrical, wave, circui 
Ouster 5: Size: 134. Focus articles: nuclear physics; atomic nucleus; physic 

focus terms rparticle, energy, electron, charge, nuclear, proton, radiation 
Cluster 6: Size: 84. Focus articles: measurements; units, physical; electroma 

focus terms:unit, value, measure, numb, measurement, function, equal, obje 
Cluster 7: Size: 187. Focus articles: space exploration; Space Shuttle; Soyuz 

focus termsapace, launch, fright, orbit, satellite, mission, rocket, eart 
Ouster 8: Size 132. Focus articles: physics, history o; Einstein, Albert; g 

focus termsjtheory, physic, light, physicist, motion, einstein, law, parti 
Cluster 9: Size: 210. Focus articles: star, extragalacric syst; astronomy and 

focus termsstar, galaxy, light, year, sun, distance, mass, cluster, brigh 
real time 37146 msec 



operator. 



____ >_(time (setq fourth_(ouuine_third 7))) 

cluster 187 items 

global cluster 43 items . . . 

cluster sizes I 24 1 1 15 1 11 2 5 

assign to nearest . . . done 

cluster sizes 1 6 20 5 2 79 1 47 4 22 

assign to nearest . . . done 

cluster sizes 1 9 22 8 2 69 1 46 4 25 

Cluster 0: Size: L Focus articles: Stealth 

focus term*radar, bomber, aircraft, fly, stealth, shape, wing, replace, a 
Ouster 1: Size: 9. Focus articles: Juno; von Braun, Wemher; Jupiter, soun 

focus termsirockeupace, missile, research, jupiter, redstone, satellite 
Ouster 2: Size: 22. Focus articles: rockets and missil; Atlas; Thor; Titan; 

focus tennsmiissile, rocket, launch, stage, kg, thrust, space, ballistic. 
Ouster 3: Size: 8. Focus articles: helicopter; VTOL; flight; jet propulsio 

focus termsflight, engine, air, aircraft, rotor, helicopter, lift, speed, 
Ouster 4: Size: 2. Focus articles: STOL; C-5 Galaxy; 

focus termsaircraft, wing, speed, lift, engine, air, takeoff, land, weigh 
Ouster 5: Size: 69. Focus articles: space exploration Soyuz; Salyut; VoDco 

focus termsapace, launch, soyuz, cosmonaut, soviet, flight, spacecraft, m 
Ouster 6: Size: 1. Focus articles: railgun; 

focus ternusprojectile. sec, accelerate, speed, space, test, launch, field 
Ouster 7: Size: 46. Focus articles: Gordon, Richard F.; Stafford. Thomas P; 

focus termsastronaut, apollo, pilot, space, lunar, mission, flight, moon, 
Ouster 8: Size: 4. Focus articles: phobia; claustrophobia; agoraphobia; He 
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-continued 



focus terms^pace, fear, phobia, claustrophobia, canvas, person, agoraphob 
Ouster 9: Size: 25. Focus articles: communications sat; GEOS; Vanguard; SYN 

focus termssatellite, launch, orbit, space, earth, kg, communication, pro 
real time 1822! msec 



Two relevant, yet distinct clusters emerge at this 
stage; namely 5 (Soviet Space Exploration) and 7 (U.S. 
Space Exploration). The contents of these clusters is io 
exhaustively displayed as follows: 



-continued 



(print-titles (nth S fourth)) 



40 


Zond 


74 


Zholobov, Vhnly 


238 


YcHiseyev, Aleksei 




Stanislavovich 


239 


Yegorov, Boris B. 


921 


weightlessness 


1269 


Vostok 


1270 


Voskhod 


1286 


Volynov, Boris 




Valentinovich 


1306 


Volkov, Vladislav 




Nikolayevich 


1574 


Venera 


2345 


tracking station 


2522 


Thov, Ghennan S. 


2881 


Tereshkova, Valentina 


3959 


Sputnik 


4120 


Spacclab 


4125 


space station 


4126 


Space Shuttle 


4127 


space medicine 


4128 


space law 


4129 


space exploration 


4131 


Soyuz 


4365 


SNAP 


4477 


Skylab 


4849 


Shatalov, Vladimir A. 


4943 


Sevasrianov, Vitaly I. 


5465 


Sarafanov, Gennady 




Vasilievich 


5611 


Salyut 


5809 


Ryumin, Valery 


5893 


Rukavishnikov, Nikoby 


5928 


Rozhdestvensky, Valery 


6074 


Romanenko, Yury 




Viktorovich 


6457 


Remek, Vladimir 


66S2 


Ranger 


7381 


Popovich, Pavel 




Romano vich 


8267 


Patsayev, Viktor 


9751 


National Aeronautics and 




Space Administiation 


10915 


Mercury program 


11104- 


McNairrRonaW ~ 


11437 


Mars 


11729 


Makarov, Oleg 


12006 


Lyakhov, Vladimir 


12042 


Lunokhod 


12054 


Luna (spacecraft) 


12616 


Leonov, Aleksei 


12723 


Lebedev, Valentin 


12766 


Lazarev, V. G. 


13145 


Kubasov, Valery N. 


13243 


Komaraov, Vladimir 


13296 


Klimuk, P. L 


13427 


Khrunov, Yevgeny 


13513 


Kennedy, Space Center 


13626 


Kapustin Yar (ka-poas- 




tin yahr) 


14072 


Jarvis, Gregory 


14224 


Ivanov. Georgy 


14226 


Ivanchenkov, Aleksandr 


16208 


Gubarev, Aleksei 


16439 


Grechko, Georgy 




Mikhaflovich 


16665 


Gorbatko, Viktor V. 


16796 


Godard Space Flight 



15 



20 



25 



30 



35 



40 



45 



50 



55 



60 





Center 


16864 


Glazkov, Yury 




Nokolayevich 


18383 


Feoktistov, Konstantin P. 


19449 


Dzhanibekov, Vladimir 


19906 


Dobrovolsky, Georgy T 


20266 


Demin, Lev Stepanovich 


23340 


Bykovsky , Valery 


25920 


astronautics 


26021 


Artyukhin, Y. P. 


26313 


Apollo-Soyuz Test Project 


27103 


Aksenov, Vladimir 


(print title (nth 5 fourth)) 


III 


xoung, j one w. 


403 




753 


tt I11LC, CUWal U IX., LX 


903 


weiiz, raui j. 


3391 


Surveyor 


3910 


Stafford, Thomas P. 


4460 




4805 


onepsiu, Aian t>., jr. 


5173 


Scott, David R. 


5211 


Scfaweickart, Russell L. 


5289 


Schirra, Walter M., Jr. 


6047 


Roosa. Stuart A. 


7519 


Pogue, William Reid 


10526 


Mitchell, Edgar D. 


11139 


McDivitt, James 


11245 


Mattingiy, Thomas 


12050 


Lunar Rover 


12051 


Lunar Orbiter 


12052 


Lunar Excursion Module 


12148 


Lottsma, Jack 


13897 


Johnson Space Center 


14297 


Irwin, James 


16026 


Haise, Fred W. 


16299 


Grissom, Virgil L 


16651 


Gordon, Richard F., Jr. 


16998 


Gibson, Edward 


17189 


Gemini program 


17282 


Garriott, Owen 


18697 


Evans, Ronald 


19236 


Eisele, Donn F. 


19579 


Duke, Charles, Jr. 


20916 


Crippen, Robert 


21243 


Cooper, Leroy Gordon, Jr. 


21348 


Conrad, Pete 


21593" 


CoDinsT Michael 


22479 


Chaffee, Roger 


22523 


Ceraan, Eugene 


22826 


Carr, Gerald 


22835 


Carpenter, Scott 


23924 


Brand, Vance 


25140 


Bean, Alan 


25921 


astronaut 


26102 


Armstrong. Neil A. 


26314 


Apollo program 


26567 


Anders, William Alison 


26998 


Aldrin, Edwin E. 



The existence of two sets of relevant documents has 
been discovered with relatively disjoint vocabularies. 
At this stage individual documents may be examined, or 
some directed search tool might be applied to this re- 
65 stricted corpus. This example illustrates that the steps of 
determining a summary (with Cluster Digest), and pro- 
viding a further ordering (with, for example, Buckshot) 
can be performed multiple times. 
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B Procedures 0r SOme 0ther P roceciure * result is a set P of k dis- 

joint document groups such that U =C 
For each document, a, in a collection (or corpus), C, The Buckshot method uses random sampling fol- 
let a countfile c(a) be a set of words, with their frequen- lowed by truncated agglomerative clustering to find the 
cies, that occur in that document Let V be a set of 5 k centers. Its refinement procedure is simply an iteration 
unique words occurring in C. Then c(a) can be repre- of assigning each document to a k center where new 
sented as a vector of length | V | ; centers are determined from a previous partition P. 

The Fractionation method uses successive stages of 

{truncated agglomerative clustering over fixed sized 
0 if wtta 10 groups to find the k centers. Refinement is achieved 

fi»t,a) ifhvea through repeated application of procedures that at- 

tempt to split, join and clarify elements of the partition 
P. 

for 1 £i£ | V | , where w/is the ith word in V and ^a) The Buckshot method sacrifices some precision (in 
is the frequency of w,in a. 15 t h e sense of document misclassification) in favor of 

To measure the similarity between pairs of docu- speed, while Fractionation attempts to find a very high 
ments, a and /?, let the cosine be employed between precision partition through exhaustive refinement 
monotone element-wise functions of c(a) and c(fi). In Buckshot is appropriate for the on-the-fly online rechis- 
particular, let tering required by inner iterations of Scatter-Gathcr, 

while Fractionation can be used, for example, to estab- 
= <g(c(ai\s(c(B))> lish the primary partitioning of the entire corpus, which 

II II II II is displayed in the first iteration of Scatter-Gather. 

where g is a monotone damping function and "<>" Buckshot 

denotes inner product. g(x)=Vx produces good results 25 The Buckshot method employs three subprocedures. 

hence, g(x)= Vx is used in the current implementation. The first subprocedure, truncated group average ag- 

However, it is understood that other values of g(x) glomerate clustering, merges disjoint document sets, or 

could be used. groups, starting with individuals until only k groups 

It is useful to consider similarity to be a function of 3Q remain. At each step the two groups whose merger 

document profiles, where would produce the least decrease in average similarity 

are merged into a single new group. The second sub- 

p/ a \ procedure determines a trimmed sum profile from se- 

m II || • lected documents closest to a document group centroid. 

35 The third subprocedure assigns individual documents to 

in which case the closest center represented by one of these trimmed 

sum profiles. 

\v\ For truncated group average agglomeration, let T be 

j(a,£) = <p(a),p(p)> » ^ p(p)uKfi)h a document group, then the average similarity between 

40 any two documents in T is defined to be: 

Suppose T is a set of documents, or a document 

group. A profile can be associated with T by defining it 5(r) 2 2 2 ft 

to be a normalized sum profile of contained individuals. I r 10 - I r I ) a«r /5«r 
Let 

45 Let G be a set of disjoint document groups. The basic 
iteration of group average agglomerative clustering 

AO = a | r /<a), finds the pair V and A' such that 



define an unnormalized sum profile, and 50 s^ua 1 ) = max(maxS(r u A)) 

TtG A#r 



JffL 



Hflnil A new, smaller, partition G' is then constructed by 

merging V with A', 
define a normalized sum profile. Similarly, the cosine 55 

measure can be extended to T by employing this profile G'=(G-{r\A'»u {r UA'}. 

definition: 

Initially, G is simply a set of singleton groups, one for 
s(r.x)^ <pCTXp(x)>. each individual to be clustered. The iteration terminates 

60 (or truncates) when |G'|=k. Note that the output from 
B. 1. Partrtional Clustering ^ P rocedure ^ the final flat partition G\ rather than a 

nested hierarchy of partitions, although the latter could 
Partitional clustering presumes a parameter k which be determined by recording each pairwise join as one 
is the desired size of the resulting partition (number of level in a dendrogram. 

subgroups). The general strategy is to: (1) find k centers, 65 If the cosine similarity measure is employed, the inner 
or seeds; (2) assign each document to one of these cen- maximization can be significantly accelerated. Recall 
ters (for each a in C assign a to the nearest center); and that p(T) is the unnormalized sum profile associated 
(3) possibly refine the partition, either through iteration with T. Then the average pairwise similarity, S(T), is 
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simply related to the inner product, <p(0,p(r)>. m may be defined adaptively as some percentage of 
That is, since: | r | , or may be fixed. For example, in one implementa- 

tion m=20. 

The trimming of far away documents in determining 
<fXD&r)> « x x <p(a),p(ft> 5 the centroid profile leads to better focussed centers, and 

a€T ^ ef hence to more accurate assignment of individual docu- 

= |r jcf r| - i)S(T) + 2 <p(a),p(a)> ments m Assign-to-Nearest. 

= |r|(in - i)S(r) + 2</<aXp(a)> B 2 b * *° Nearest 

<p(Dpfn> - in 10 For ^^S 11 " 10 ^ 1 ^ 65 ^ once k centers have been 

5(0 = . r 'j|j r i _ [j • found, and suitable profiles defined for those centers, 

each document in C must be assigned to one of those 
Similarly, for the union of two disjoint groups, centers based on some criterion. The simplest criterion 
A = T U A assigns each document to the closest center. 

15 Let G be a partition of k groups, and let I\be the ith 
group in G. Let: 



f k \ 

II,- = < ace <af^rd> = msx<a 1 pnXTj» h 



<^A),M)>«r{ + |Aj) 
w ° (|r| + |A|)((|r| + i) 

where, 20 

<p(A).p(A)> = <pOW)> +2<p<0, 
p(A)> + <p(2l),p(A)> 

Ties can be broken by assigning a to the group with 
Therefore, if for every V e G, S(T) and p(T) are known, lowest index. P={ n /}, o^i^k is then the desired as- 
the pairwise merge that will produce the greatest aver- 25 sign-to-nearest partition. 

age similarity can be cheaply determined by determin- P can be efficiently determined by constructing an 
ing one inner product for each candidate pair. Further, inverted map for the k centers p m (r^, and for each aeC 
suppose for every T € G the A were known such that: simultaneously determining the similarity to all the cen- 
ters. In any case, the cost of this procedure is propor- 
30 tional to kN, where N= |C|. 

5<rn A) = *™ srr u A). b ^ c Procedure 

FIG. 6 is a flowchart of a preferred embodiment of a 

then finding the best pair would simply involve scan- Buckshot method for providing a further ordering of a 

ning the | G | candidates. Updating these quantities with corpus of documents in the Scatter-Gather method as 

each iteration is straight forward, since only those in- shown in step 18 of FIG. 2. The Buckshot fast cluster- 

volving V and A' need be redetermined. ing method takes as input a corpus, C, an integer par am - 

Using techniques such as those described above, it eter k, 0<k ^ |C| , and produces a partition of C into k 

can be seen that the average time complexity for trun- groups. Let N = | C | . The steps of the Buckshot method 

cated group average agglomerarive clustering is include: 

i.e., proportional to N 2 , where N is equal to the number j. Construct a random sample C from C of size VkN, 

of individuals to be clustered. sampling without replacement (step 130). 

B.2.a. Trimmed Sum Profiles 2 - Partition C into k groups by truncated group aver- 

_ 45 age agglomerative clustering (step 132)- Call this 

For trimmed sum profiles, given a set of k document partition G 

groups that are to be treated as k centers for the purpose 3. Construct a partition P of C by assigning each 

of attracting other documents, it is necessary to define a individual a to one of the centers in G (step 134). 

centroid, or attractor, for each ^oup T. One simple This is accomplished by applying assign-to-nearest 

definition would just consider the normalized sum pro- ~ oveTthe^ormis C and the k Centers G 

ffle for each group, p(T). However, better results are 4 Replace G with P (step 136) and repeat steps 3 and 

achieved by trimming out documents far from this cen- 4 once. 

r , . , ™ , , r ^ , x 5. Return the new corpus partition P (step 138). 

For every a m T let r(a,r) be the rank of <p(a) c - o nr%A „ m > !Z" . . <i V fc . 

«rrw +1,0 f ^~rn\ ,*rrw .n*r\ t<- w~ r\ 1 Smce random samplmg is employed m the definition 

Then define- s 7 repeated calls to this procedure on the same corpus may 

well produce different partitions; although repeated 
calls generally produce qualitatively similar partitions. 
^r)= 2 p(a). It is easy to see that, if the C is actually composed of k 

a3/<a,n=m 60 well separated clusters, then as N increases, one is essen- 

tially assured of sampling at least one candidate from 
each cluster. Hence, asymptotically Buckshot will reli- 
PnlT) = pnjxy || KO || . ably find these k clusters. 

Step 1 could be replaced by a more careful determin- 
This determination can be carried out in time propor- 65 istic procedure, but at additional cost In the contem- 
tional to \T\ log | T | . plated application, Scatter-Gather, it is more important 

Essentially, p m (T) is the normalized sum of the m that the partition P be determined at high speed than 
nearest neighbors in T to p(T). The trmoming parameter that the partitioning procedure be deterministic. In- 
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deed, the lack of determination might be interpreted as 
a feature, since the user then has the option of discard- 
ing an unrevealing partition in favor of a fresh recluster- 
ing. 

Step 4 is a refinement procedure which iterates as- 5 
sign-to-nearest once. Further iterations are of course 
possible, but with diminishing return. Since minimizing 
determine time is a high priority, one iteration of assign- 
to-nearest is a good compromise between accuracy and 
speed. 10 

Steps 2, 3 and 4 all have time complexity proportional 
to kN. Hence the overall complexity of Buckshot is 
0(kN). 

B.3 Cluster Digest 

FIG. 5 is a flowchart of a preferred embodiment of a 
Cluster Digest method for determining a summary of an 
initial ordering of a corpus of documents in the Scatter- 
Gather document partitioning method as shown in step 
17 of FIG. 2. For Cluster Digest, given a partition P, it 20 
is useful for diagnostic and display purposes to have 
some procedure for summarizing the individual docu- 
ment groups, or clusters, contained in P. One simple 
summary is simply an exhaustive list of titles, one for 
each document in . However, such a summary grows 
linearly with group size and hence, is of limited use for 
large clusters. For Scatter-Gather, it is desirable for the 
summary to be constant size, so a fixed number of sum- 
maries can be reliably fit onto a given display area. 
Thus, in step 120, a summary of constant size is pro- 30 
vided, and in step 122, a fixed number of 'topical" 
words plus the document titles of a few "typical" docu- 
ments are listed. Here "topical" words are those that 
occur often in the cluster and 44 typical" documents are 
those close to the cluster centroid (or trimmed cen- 35 
troid). This summary is referred to as a Ouster Digest. 

Cluster Digest takes as input a document group , 
and two parameters, co, which determines the number of 
topical words output and d, which determines the num- 
ber of typical documents found. The list of topical 40 
words is generated by sorting p(T) (or p«CO) by term 
weight and selecting the o highest weighted terms. The 
list of typical documents is generated by selecting the d 
documents closest to pCO, 



18 



interpreted as topic coherent. Finally, the similarity 
measure at the base of the procedure depends only on 
term frequency considerations, and hence should be 
applicable over all natural languages that contain a 
lexical analysis capability. 

First, a document corpus is divided into a number of 
buckets, each of a given fixed size. This initial division 
can be done at random or by making use of document 
similarity based on word overlap to induce an ordering 
over the corpus that places together documents likely 
to be similar. These fixed size buckets are individually 
clustered using a standard agglomerative procedure. 
However, a stopping rule removes agglomerations from 
consideration once they achieve a given size, for exam- 
ple, 5. Agglomerative clustering terminates when the 
ratio of current output agglomerations to original inputs 
falls below a given threshold, the reduction factor cur- 
rently being, for example, 0.25. The outputs are col- 
lected and rearranged into fewer new buckets of the 
same fixed size where groups of documents, rather than 
individuals, are counted as employed in the initial divi- 
sion, and the output collection and rearranging process 
is reiterated. This stage of the Fractionation procedure 
completes when the total number of outputs from a 



However, such a summary grows 25 ^P«« wnen tne total number ot outputs trom a 
*e and hence, is of limited use for ? Ven lteratl0n 15 close to desired number of Po- 



tions. 

Although the basic building block of the procedure is 
agglomerative clustering, it is always applied to groups 
of a given fixed size. Each iteration can be thought of as 
producing, in a bottom-up fashion, one level of an n-ary 
branching tree where the branching factor is the recip- 
rocal of the given reduction factor. Hence, agglomera- 
tive clustering is applied as many times as there are 
internal nodes of a tree. Since this number is propor- 
tional to the number of terminal nodes, the cost of the 
procedure is proportional to the cost of each agglomer- 
ative procedure times the number of terminal nodes. 
Thus, if n is the given fixed size, 

(n 2 ) (N/n)=nN. 



45 



50 



In one example, o=*10 and d=3. 

B.4 Fractionation 

BAa. Overview and Supporting Material 

Fractionation involves the development of an effi- 
cient and accurate document partitioning procedure 
which provides a number of desirable properties. The 
Fractionation procedure partitions a corpus into a pre- 55 
determined number of "document groups". It operates 
in time proportional to the number of documents in the 
target corpus, thus distinguishing from conventional 
hierarchical clustering techniques which are intrinsi- 
cally quadratictime. Although non-hierarchical in na- 
ture by producing a partition of the corpus rather than 
a tree, the Fractionation method can be applied recur- 
sively to generate a hierarchical structure if desired. 
New documents can be incorporated in time propor- 
tional to the number of buckets in an existing partition, 
providing desirable utility for dynamically changing 
corpora. Also, the resulting partitions are qualitively 
interesting, that is, with few exceptions buckets can be 



Therefore, this stage of the Fractionation procedure is 
linear in N for a fixed desired number of partitions. 

In the agglomerative process, not every pairwise 
distance between agglomerations is considered, instead, 
reasonably well behaved subcorpora are separately 
clustered with the expectation that the seeds of global 
clusters will be formed within each subcorpus. The 
scope of the determination-expands as itproceeds_up-_ 
wards, so that strong trends may be reinforced and 
weak ones subsumed. The final result is a partition of 
the original corpus into buckets where the content of 
each bucket is presumed to reflect the global trend. 
These buckets are expected to be noisy, in the sense that 
they may overlap in content, represent excessive 
merges, or contain documents better placed elsewhere. 
To repair these deficiencies, the initial partitioning is 
refined in iterative fashion. 
A number of different refinement procedures can be 
60 devised which would improve the initial partitioning. 
Refinement methods which do not increase the linear 
order of the overall procedure can be accomplished by 
analysis of nearest neighbor patterns between individual 
documents of a given partitioning. It is desirable for the 
refinement efforts to be cumulative in the sense that 
they may be applied sequentially and in combination. 
Refinement methods can be used to merge, split and 
disperse buckets. Furthermore, application of a prede- 



65 
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termined number of refinement methods does not Since the basic tool for agglomeration is group aver- 
change the linear order of the overall document proce- age agglomerative clustering, it must be applied in a 
dure. way that does not consider all pairwise similarities, or 

In contrast to standard hierarchical techniques, Frac- even a number proportional to all pairwise similarities, 
donation can be easily adapted to incremently update 5 For a constant n which is similar to the desired size of 
and dynamically change the partition of a given corpus. the final partitioning, a corpus of N can be divided into 
One simple strategy is to add new documents to the groups of size n. Truncated agglomerative clustering 
closest matching bucket of an existing partition. Refine- may be applied to each group in time proportional to n 2 . 
ment processes would run at regular intervals which Since there about N/n groups, the total time for this 
could introduce new buckets if necessary by splitting 10 step is proportional to (N/n)n 2 =Nn. 
old ones or merging buckets as distinctions become less Simple agglomerative clustering may be elaborated 
clear. Also, although the resulting partition is not a by supplying alternative stopping rules. These come in 
hierarchical structure, it can be made so if desired, by two types: conditions which limit the growth of any one 
recursive application of Fractionation to the subcorpora agglomeration and conditions which terminate the 
defined by single buckets. 15 overall process before a single over-arching agglomera- 

Some particularly distinctive characteristics of Frac- tion is formed. A condition of the first type states that if 
donation, as compared to other document partitioning an agglomeration grows to size k, then it is removed 
procedures, include: from consideration in any further agglomeration steps. 

(1) Intermediate score. Rather than using word oc- A condition of the second type terminates the iteration 
currence frequencies as a basis for document similarity, 20 if the total set of agglomerations currently under con- 
an intermediate score between the document and pres- sideration represents no more than r% of the size of the 
ence or absence in representation of document profiles original corpus. Let k be the "stop" parameter and r be 
is used. In particular, good performance is provided the "reduction" parameter. 

using square roots of counts within documents when Suppose two constants b and k are chosen, and ag- 
summing the square roots across agglomerations. This 25 glomerative clustering is applied with reduction 1/b. 
allows putting about five documents or agglomerations That is, agglomeration stops as soon as fewer than n/b 
together in a single step. objects remain. After one step, the N/b outputs can be 

(2) Stepwise assembly. Preliminary buckets greater treated as individuals and iterated. This will take nN/b 
than or equal to five documents are not combined with time. Thus the total time involving agglomerative clus- 
single documents. Rather, stages of combining docu- 30 tering will be 

ments by five are iterated until the desired number of 

final buckets has been reached b 

(3) Recurrent trimming. Documents which are not nN ^ 1 + * + *«■•> b z \ 
included in the initial preliminary bucketing into groups 

of greater than or equal to five documents are excluded 35 which is rectangular in n and N. This process will be 
from further procedures until the final iteration. At that described as a center finding procedure, since it has 
time, the documents are assigned to the closet final frequently been used with k= 5 and r =0.25, and may be 
bucket viewed as a bottom-up construction of a 1/r branching 

(4) Repeated improvement Improvements are imple- tree. 

mented through procedures such as repeated alterna- 40 The refinement procedures, to be discussed herein as 
tions of split and merge operations. well as the assign-to-nearest procedure and the multiple 

(5) No reliance on natural boundaries. Documents are use of agglomerative clustering are all rectangular in 
partitioned according to internal coherence and reason- the size of the number of individuals and either the 
able distinctiveness, rather than by any comparison number of buckets in the partitioning, or the size of the 
with predetermined boundaries. 45 initial groups, n, which was chosen to approximate the 

nit T*t_ i? , size of the final partitioning. 

B.4.b. The Fractionation Procedure FIG. 4 is a flowchart of one preferred embodiment of 

The overall procedure for Fractionation when start- the Fractionation Method described herein which can 
-ing with a new corpus, falls naturaUy into three stages,. _J«jemployed asjhe step of initializing a partition of a 
shown in FIG. 4: 50 corpus of documents or preparing aji initial ordering 16~ 

Preparing an initial ordering of the corpus (step as shown in FIG. 2. 
50); Fractionation can be thought of as a more careful 

Determining a partitioning of the desired size from version of Buckshot that trades speed for increased 
the initial ordering (step 66); and accuracy. In particular, steps 1 and 2 of Buckshot are 

Improving the partitioning by refinement (step 55 replaced by a deterministic center finding procedure, 
68). and step 4 is elaborated with additional refinement itera- 

Linear time is sustained because quadratic-time oper- tions and procedures. Since the refinement procedures 
ations are restricted to subcorpora whose size does not have the capability to merge, split and destroy clusters, 
grow with the size of the corpus. Fractionation may not produce exactly k document 

Simple agglomeration applied to a large corpus is 60 groups. In other words, Fractionation is an adaptive 
quadratic in the number of individuals. If it were possi- clustering method. 

ble to produce a reasonable initial partitioning of some The center finding procedure finds k centers by ini- 
given size in time proportional to the number of individ- tially breaking C into N/m buckets of a fixed size m. 
uals, the application of any fixed number of refinement The individuals in these buckets are then separately 
processes would not increase the order of the overall 65 agglomerated into document groups such that the re- 
procedure. Further, if not initially then through refine- duction in number (from individuals to groups in each 
ment, the final result can be a good (internally coherent) bucket) is roughly p. These groups are now treated as if 
partitioning of the corpus. they were individuals, and the entire process is re- 
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peated. The iteration terminates when only k groups treated as individuals for the next iteration. That is, 

remain. The center finding procedure can be viewed as define 
building a l/p branching tree bottom up, where the 

leaves are individual documents, terminating when only c={o^i Si^N/m, i Sjspm}. 

k roots remain. 5 

Suppose the individuals in C are enumerated, so that c ' inherits an enumeration order by defining 

C=ai,<X2, ,aw This ordering could reflect an ex- 
trinsic ordering on C, but a better procedure sorts C *// s *«V/m<i-i)+> 

based on a key which is the word index of the Kth most _ 

common word in each individual, where 1 ^K< | V | . 10 ^ process is then repeated with C instead of C. That 
Typically K is a smaller number, such as three, which ^ the ? N components of C are broken into pN/m 
favors medium frequency terms. This procedure thus buckets > which are further reduced to p*N groups that 
encourages that nearby individuals in the corpus order- agglomeration. The process terminates at item- 
ing have at least one word in common. Uon J lf P /N < k - At ^ point one final application of 
Group selection that provides relatively weak simi- 15 agglomerative clustering can reduce the remaining 
larity of documents grouped together or placed in adja- ^IP^^f ° f ^ k (Step 66) ' for instance 
cent groups, will enhance the quality of the step-by-step wito m-200 and p— 4 . _ 

bucketing of these groups. Thus, it is worthwhile to do \. he fractionation refinement procedure (Step 68) 

some structuring Gnitial ordering) of the corpus into Performs a fixed number of derations of a cycle of three 

groups of adjacent elements of size n. One example of 20 separate operators. The first operator, Split, generates a 

such an initial ordering process (box 50, FIG. 4) is based ^w Partition by dividing each document group in an 

upon word similarity, and is described as follows: ^ JastlDg mto tw0 P arts - P e «oond operator, 

Join, merges document groups that are mdistmguish- 

Initial ordering able from the point of view of Cluster Digest The third 

Input C, a corpus 25 operator is some number of iterations of Assign-to- 

1. Sort words (stem types) by entries in corpus count- % ^^^n of groups smaller 
ffle, most frequent fim (Step 52). than some threshold. 

LI Segment the sorted countfile (Step 54) accord- c cvc ? *wM group T m a partition P, the 

ing to frequency>c (segment c), c>frequen- ^put 0pCra f 0 ^Xl dcS T ^ *2° ?T ^ ou P s - 7*°* ^ 

cy*d (segment d), and frequency (segment 30 * accomplished by applying Buckshot ^clustering with 

e j > © C=T and k=2. The resulting Buckshot partition G 

1.2 Rearrange countifle (Step 56) in order segment P ™ d * *V™ r ° new . _ (r _ ^ 
d, then segment c, then segment e. J* .^l V \ X ^ £5*£ {r " J < l} * a 

1 .3 Renumber words, (Step 58) to reflect this reor- d f m ? lt » uc ^t partition of V, The new partition 
dering. 35 ^ IS ^P*y the union of the G/s: 

2. Label (Step 60) each document by the number of 

the earliest word in the sorted file which appears in k • 

iL " P= UCf 

3. Adjoin (Step 62) to this number the number of the 
first word (in text order) in the document. 40 

4. Sort documents (Step 64) by the compound label N ° te that l F l =2k - Each application of Buckshot re- 
(earliest in countfile, earliest in document). quires time proportional to 2|T/[. Hence, the overall 

The various steps take times of the following orders: determination can be performed in time proportional to 

Step 1 takes |V|log|V|; step 2 takes |C| time (multi- 2N * . 

plied by the average length of a documents profile); step 45 A modification of this procedure would only split 

3 takes |C| time; step 4 takes |C|log|C| time. Step 1.1 groups that score poorly on some coherency criterion, 

takes | V | time, all of 1.1 to 1.2 are bounded by the total 9 11 ? appk coherency criterion is simply the average 

vocabulary size, and grow only slowly and boundedly similarity to the cluster centroid: 

as |C| grows. • ~ — ~ — — . _ 

The center finding procedure builds an initial parti- 50 i__ 

tioning of a corpus C by first applying the structuring ( " IH aer 3(a,r) ' 

process to construct a crude grouping into groups of 

size n and then applying agglomerative clustering as Let r(T/^) be the rank of A(I\in the set {A(Ti),A(T2), 

previously described. . . . ,A(r*}. This procedure would then only split 

The initial bucketing creates a partition B={8i, 02, - 55 groups such that r(I\P)<pk for some p, 0<p< 1. This 

■ • ,0/v/m}: modification does not change the order of the proce- 
dure since the coherence criterion can be determined in 

©i-{a tf) (/-.i) + i t a Jn ( I _i) + 2, • - - ^md time proportional to N. 

, ^ , The purpose of the Join refinement operator is to 

The document groups 9/ are then separately cms- 60 merg e document groups in a partition P that would be 

tered using an agglomerative procedure, such as trim- difficult to distinguish if they were summarized by Clus- 

cated group average agglomerative clustering, with t er Digest. Since by definition any two elements of P 

k=pm, where p is the desired reduction factor. Note are disjoint, they will never have "typical" documents 

that each of these determinations occurs in m* time and m common. However, their lists of 'topical" words 

hence, all N/m occur m Nm time. Each application of 65 may well overlap. Therefore the criterion of distin- 

agglomerative clustering produces an associated parti- guishability between two groups r and A will be: 
tion Rf—i^i.Vi^ . . . ,^4p m }. The union of the docu- 
ments groups contained in these partitions are then T(r\A)«|t(r)nt(A)| 
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clustering method. Fractionation is employed rather 
where t(T) is the list of topical words for T generated than Buckshot, since time is less of an issue for an off- 
by the Cluster Digest summary. line determination; hence time can be traded for accu- 

Let T be related to A if T(T,A)>p, for some p, racy. However, as noted in the earlier example, the 
0<p^w. Let St be the transitive closure of this rela- 5 initial ordering could be performed by Buckshot 
tion. That is, S (T,A) if there exists a sequence of At start, this initial partition is presented to the user 
groups (Ai,A2, . . . ,A m ) such that T(T,Ai)>p,T(A/,- (via monitor 8 or printer 10) by applying Cluster Digest 
A/+ 1)> P for 1 ^ i <m, and T(A m , A) > p. Since 3 is an to each of the k document groups in order to determine 
equivalence relation, a new partition F is generated by a partition of a desired size as shown in step 17 of FIG. 
X . This partition is returned as the result of Join. 10 2. Since Cluster Digest determines only two quantities 

The determination of the transitive closure requires of interest, a short list of typical document titles, and a 
time proportional to k 2 ; hence the time complexity of short list of topical words, the presentation of each 
Join is 0(k 2 ). document group need not occupy more than a few lines 

In one illustrative implementation w=10 and p=6. of the display. In particular, it should be possible to 

The inputs of refinement are an existing partition P, a 15 present the entire partition in such a way that scrolling 
parameter I which determines the number of Split, Join, is not necessary to view all the summaries. 
Assign-to-Nearest cycles executed, a parameter M The operators available to the user are Pick, Reclus- 
which determines the number of Assign-to-Nearest ter, Display and Backup. Given an existing partition, 
iterations in each cycle, and a threshold T which is the the Pick operation allows the user to select one or more 
minimum allowable document group size. A typical 20 document groups to form a new, reduced, subcorpus. 
sequence of refinement steps follows: The Recluster operator applies Buckshot to the current 

Do I times reduced subcorpus, defined by the last application of 

1 Split P to form P* Pick. The Display operator presents the current parti- 

2 Join P' to form P" don by repeated application of Cluster Digest The 

3 Do M times 25 Backup operator replaces the current subcorpus with 

3. 1 Apply Assign-to-Nearest to P" to form G. the subcorpus defined by the second' most recent appli- 

3.2 Eliminate T in G if |T|<T to form G' cation of Pick. 

3.3 Let P"=G' A typical session iteratively applies a cycle of Pick, 

4 Return P Recluster, and Display until the user comes across doc- 
The Himination operator of Step 32 is implemented 30 ument groups of interest small enough for exhaustive 

by supplying Assigning-to-Nearest with C=T and display, or switches to the use of a directed search tool 
merging the resulting partition into G. based on a term uncovered in the browsing process. A 

Steps 1, and 3.1 require time proportional to N, Step false Pick step can be undone by application of Backup. 
2 requires time proportional to k 2 , and step 3.2 requires While the present invention has been described with 
time proportional to the number of individuals assigned 35 reference to particular preferred embodiments, the in- 
to nearest, which is always less than N. Therefore, as- vention is not limited to the specific examples given and 
suming k 2 <N, which is typically the case, the overall other embodiments and modifications can be made by 
time complexity of refinement is 0(N). those skilled in the art without departing from the spirit 

In the illustrative implementation 1=3, M=3 and and scope of the invention. 
T = 10 - 40 What is claimed is: 

Fractionation takes as input a corpus C, and a param- 1. A document browsing method in a digital com- 
eter k, the desired number of clusters. puter for a corpus of documents, comprising the steps 



C. Summary 



of: 



_ preparing an initial ordering of the corpus into a first 

The Fractionation method as summarized includes 45 plurality of clusters by using a first method that 

the following steps: automatically performs the initial ordering without 

1 Apply center finding to Construct an initial parti- external inputs based on contents of the documents 
tion P using the digital computer; 

2 Apply-Assign-to-Nearest with G=P to form.P: determirimg_a_summarxfor each cluster of the first 

3 Apply Refinement to form P" 50 plurality of clusters prepared by said mitial order- 

4 Return P" as the final partition of C Note that, due ing of the corpus; 

to refinement, the size of the returned partition may selecting by a user at least one cluster of the first 

not be equal to k. plurality of clusters based on the summary of each 

In contrast to Buckshot, Fractionation is a determin- cluster; and 

istic procedure which returns the same partition if re- 55 automatically providing a further ordering of the user 

peatedly called with the same values for C and k. selected at least one cluster into a second plurality 

Step 1 is 0(mN); step 2 0(kn); and step 3 0(N). of clusters by automatically analyzing contents of 

Therefore assuming k<m, the overall time complexity documents of the selected at least one cluster using 

of Fractionation is 0(mN). a second method comprising the steps of: 

Scatter-Gather is an interactive browsing tool, and 60 grouping together all of the documents from the 
hence is better thought of as a collection of operators, selected at least one cluster based on the content 
rather than as a method that determines a particular of each document, and then 
quantity. It is presumed that a fixed corpus of interest, assigning each of the documents to one cluster of 
C, has been identified, and that the user is seated at a the second plurality of clusters, 
display (monitor 8) from which commands can be is- 65 2. The method of claim 1, wherein the preparing step 
sued (via input device 9) to perform the next step. Since includes a Fractionation method for partitioning the 
C is fixed, it is presumed that an initial partition can be corpus of documents, said Fractionation method corn- 
determined off-line by application of the Fractionation prising the steps of: 
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preparing an ordering of the corpus; 

determining a partitioning of a desired size from the 

ordering; and 
refining the partitioning. 

3. The method of claim 2, wherein the preparing an 5 
ordering step includes: 

sorting words in order of frequency, most frequent 
word first, by entry into a corpus countfile; 

labeling each document by a number of an earliest 
word in a sorted corpus countfile; 10 

adjoining the number of an earliest word in a sorted 
corpus countfile to a number of a first text-ordered 
word in the document to form a compound label; 
and 

sorting documents by the compound label. 15 

4. The method of claim 3, wherein the sorting words 
step further comprises: 

segmenting the sorted corpus countfile according to 

frequency into a number of segments; 
rearranging the sorted corpus countfile according to 20 

segments; and 
renumbering words to reflect the rearranging. 

5. The method of claim 2, wherein the determining a 
partition step comprises truncated group averaging 
agglomerative clustering which includes limiting a 25 
growth of an agglomeration by terminating a group 
averaging agglomerative clustering before a single 
over-arching agglomeration is formed. 

6. The method of claim 2, wherein the refining step 
includes refining with a assign-to-nearest method for 30 
assigning a document to a nearest bucket. 

7. The method of claim 2, wherein the refining step 
includes merging similar buckets. 

8. The method of claim 2, wherein the refining step 
includes splitting non-similar buckets. 35 

9. The method of claim 2, wherein the refining step 
includes detecting at least one of weak similarity and 
small buckets and incoherent buckets by applying size 
and average similarity thresholds. 

10. The document partitioning method of claim 1, 40 
wherein the determining a summary step includes deter- 
mining a summary using a Cluster Digest method, said 
Cluster Digest method comprising the steps of: 

providing a summary of constant size for each clus- 
ter; and 45 

listing a fixed number of topical words plus document 
titles of a few typical documents within each clus- 
ter, wherein the topical words are words that 

occur often in the cluster and typical documents 

are documents close to a cluster centroid. 50 

11. The document partitioning method of claim 10, 
wherein the providing a further ordering step includes 
providing a further ordering using a Buckshot method, 
said Buckshot method comprising the steps of: 

constructing a rando m sa mple from the corpus of 55 
documents of size VkN where k is an integer num- 
ber of desired clusters and N is a number of docu- 
ments in the corpus of documents; 

partitioning into a partition G a random sample into k 
groups using truncated group average agglomera- 60 
tive clustering; 

constructing a partition P of the corpus of documents 
by assigning each document to a k center in parti- 
tion G and applying an assign-to-nearest procedure 
over the corpus and the k centers in partition G; 65 

replacing partition G with partition P and repeating 
the step of constructing a partition; and 

returning a new partition P. 



778 

26 

12. The document partitioning method of claim 1, 
wherein the providing a further ordering step includes 
providing a further ordering step using a Buckshot 
method, said Buckshot method comprising the steps of: 

constructing a random sample from the corpus of 
documents of size VkN where k is an integer k 
number of desired clusters and N is a number of 
documents in the corpus of documents; 

partitioning into a partition G a random sample into k 
groups using truncated group average agglomera- 
tive clustering; 

constructing a partition P of the corpus of documents 
by assigning each document to a k center in parti- 
tion G and applying assign-to-nearest over the 
corpus and the k centers in partition G; 

replacing partition G with partition P and repeating 
the step of constructing a partition; and 

returning a new partition P. 

13. The document browsing method of claim 1, 
wherein the first method for preparing an initial order- 
ing of the corpus is the same as the second method for 
providing a further ordering of a portion of the corpus. 

14. A document browsing system for use with a cor- 
pus of documents in a digital computer, the document 
browsing system comprising: 

preparing means for preparing without external in- 
puts an initial ordering of the corpus into a first 
plurality of document clusters using the digital 
computer; 

determining means for determining a summary for 
each cluster of the first plurality of document clus- 
ters prepared by said preparing means; 

selecting means for a user to select at least one of the 
first plurality of document clusters; and 

ordering means for automatically ordering the se- 
lected at least one of the first plurality of document 
clusters into a second plurality of clusters by 
analyzing contents of documents of the selected at 
least one of the first plurality of document clus- 
ters, 

grouping together all of the documents from the 
selected at least one of the first plurality of docu- 
ment clusters based on the contents of the docu- 
ments of the selected at least one of the first 
plurality of document clusters, and then 
assigning each of the documents to one cluster of 
the second plurality of clusters. 
-15. A document partitioning Fractionation method in_ 
a digital computer for non-hierarchical, linear-time par- 
titioning of a corpus of documents, said Fractionation 
method comprising the steps of: 
preparing an ordering of the corpus by 
sorting words in order of frequency, most frequent 

word first, by entry into a corpus countfile, 
labeling each document by a number of an earliest 

word in a sorted corpus countfile, 
adjoining the number of an earliest word in a sorted 
corpus countfile to a number of a first text- 
ordered word in the document to form a com- 
pound label, and 
sorting documents by the compound label; 
determining a partitioning of a desired size from the 
ordering to form a set of buckets, each document of 
the corpus of documents assigned to only one 
bucket of the set of buckets; and 
refining the partitioning by a predetermined number 
of iterations of 
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creating a the set of modified buckets from the set 
of buckets based on contents and size of each 
bucket 

reassigning each document of the corpus of docu- 
ments to the set of modified buckets. 

16. The Fractionation method of claim 14, wherein 
the sorting words step further comprises: 

segmenting the sorted corpus countfile according to 
frequency into a number of segments; 

rearranging the sorted corpus countfile according to 
segments; and 

renumbering words to reflect the rearranging. 

17. The Fractionation method of claim 14, wherein 



10 



28 



averaging agglomerative clustering before a single 
over-arching agglomeration is formed. 

18. The Fractionation method of claim 14, wherein 
the refining step includes refining with a assign-to-near- 
est method for assigning a document to a nearest 
bucket 

19. The Fractionation method of claim 14, Wherein 
the refining step includes merging similar buckets. 

20. The Fractionation method of claim 14, wherein 
the refining step includes splitting non-similar buckets. 

21. The method of claim 14, wherein the refining step 
includes detecting at least one of weak similarity and 
small buckets and incoherent buckets by applying size 



the determining step comprises truncated group averag- 
ing agglomerative clustering which includes limiting a 15 and average similarity thresholds, 
growth of an agglomeration by terminating a group * * * * * 
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AIM.BAS 



A1MPASS 2. BAS 
1 

KYINVERT. BAS 



FREQCOMP. BAS 

~~l — 

RELATIVE. BAS 



REL-INV.BAS 



POLYSEMY. BAS 



C0MB00ST.8AS 



NEWKE Y. BAS 

r 

CHANGEFL. BAS 



FIG. 8 
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AIMPASS2. BAS 

'Invoked: oimposs2 ConfigFile 

'Creates: Key. Ndx, Weight. Ndx,\Aim\Co mm and$. Dot 

[Uses: Diet. Wrd,ldf. Sin, Dorindex.Ah,DocKeys.Ah 

'nth Rec of Key. Ndx contains NumKeysinDocn followed by 65 codes 

' TYPE KeyNdx 

' Num AS INTEGER 

; Code C I TO 63) AS INTEGER 

[nth Rec of Weight. Ndx contains 63 Saltan Weights computed with the following 
salton weight formula 
' TYPE WeightNdx 
' Weight (I TO 63) AS SINGLE 
"SAUON WEIGHT FORMULA 

Log2(FreqinO()cM)*Log2((TotDocs*2)/DocsWiMonlt2*2*lbtOocs) 
Weighted)- --- Tf ~™~ 7r _- 



FIG JOB 



KYINVERT.BAS 

Invoked: kyinvert ConfigFile 

Creates: kyinvert. Ndx, Kyinvert. Oat 

Uses: Key. Ndx.Weight Ndx iDict.Wrd for NumKeys 

nth Rec of Kyinvert. Ndx contains nth Code, ptr into Kyinvert. Oat I 

'NumDocsWith Word 

' TYPENdxtype 

Code AS INTEGER 

Index AS LONG 

Num AS INTEGER 

'pointed to record of Kylnvert.Dat contains first DocWithWbnl I IOOO*WeightInOoc 
' TYPE RecValue 
' Rec AS INTEGER 
' Value AS INTEGER 



F/GJOC 
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FREQCOMP. BAS 

Implements the Inverted Index occess method along with the weighted values 
to calculate the frequent companions for each of the words used in the 
document collection. 
Invoked - freqcomp ConfigFile 
Creates: Freqcomp. 127 

Uses: Kylnvert. Ndx, Kylnvert. Dot, Key. Ndx, Weight. Ndx 
nth Rec of FreqComp.127 contains nth code, Num of companions of this code & 
127 pairs of (code* 100* weight) 
TYPE Freqcompl27 

Num AS INTEGER 

code AS INTEGER 

comp ( I TO 127) AS INTEGER 

Value (I TO 127) AS STRING * 1 

F/GJOD 



'RELATIVE. BAS 

| -—are there any FreqComps for this keyword? 

'—we found the word in j's FClist 

'-—look for the word itself in word j's FCList 

j — apply formula of (Lower * 6 ♦ Higher)/ 7 

' sort in decreasing order, by value 

' — save first 63 as fhe relatives 

Invoked^relative ConfigFile 

Creates: Relative. 63 

Uses: FreqComp.127 & Diet. Wrd for Numkeys. 
nth Rec of Relative. 63 contains nth Code, Num of relatives of this code & 
127 pairs of (code, 100* weight) 
TYPE FreqComp63 
Num AS INTEGER 
Code AS INTEGER 
Comp (I TO 63) AS INTEGER 
Valued TO 63) AS STRING*! 



FIG JOE 
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REL-INV.BAS 

Invoked: rel-inv ConfigFile 

Creates : Rel-Inv. Ndx, Rel. Inv. Dot 

Uses= Relative.© k Dict.Wrd for NumKeys 

^iS& A ? e, " Inv - ^.contains nth code.ptr intoReMnv. Dot & NumRelsOfCode 
TYPE NdxType 

Code AS INTEGER 

Index AS LONG 

Num AS INTEGER 

pointed to record of Rel -Inv.Dat contains code of the first FreqComp list 
of nth code ithe nth code's value in that list 
TYPE RecValue 

Rec AS INTEGER 

Value AS STING *l 



FIGJOF 



; POLYSEMY. BAS 

' Invoked: polysemy ConfigeFile 

'Creates: PolySemy. Oat I PolyAvg. Oat, PolySemy. Lst 

;Uses= Relative. 63, ReLInv. Ndx, Dict.Wrd 

nth Rec of PolySemy. Dot contains Poly Value of nth word calculated os follows: 
— POLY FORMULA: 

, ' PolyM Avg3! */Avg3!/Avg20!)*Avg6!*(AvQ6!/AvgD63!)r.5*lRelfreq/DocFreq) A 4 
' TYPE PolyType " " 

Code AS INTEGER 
Value AS SINGLE 
Pas AS STRING* 2 



FIG. 10 G 
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COMBOOST.BAS 

boosts poly of combined keywords by mutliplying by them 1.4 

Invoked: comboost ConfigFile 

Creates- 

Uses: PolySemy.Dat.DictSort.Cod 



FIGJOH 



NEWKEY. BAS 

Invoked: newkey ConfigFile 
Creates: NewKey. Ndx , NewVa I. Ndx 
Uses: Key.Ndx, Weight Ndx, PoltSemy.Oat, Kylnvrts. Ndx 
NewVal.Ndx contains new weights for Doc n sorted by new weidit where 
NewWeight»(Weight * Poly) J25 
TYPE Vfeight/wgNdx 

Weight (I TO 63) AS SINGLE 
Mult AS SINGLE 

Mult is used to vary the number of sentences in the abstract program 
NewKey.Ndx contains the corresponding codes for Doc n i.e sorred by 
their new weights 
TYPE KeyNdx 
Num AS INTEGER 
Code (I TO 63) AS INTEGER 



FIG JO I 



CHANGEFL.BAS 

Invoked changefl ConfigFile 

"Creates: Kylnvrts. Ndx, Rel-Invs. Ndx 

TYPE SmallNdxType 

Index AS LONG 
Num AS INTEGER 
Uses: Rylnvert. Ndx, Rellnv.Ndx 
TYPE NdxType 
Code AS INTEGER 
Index AS LONG 
Num AS INTEGER 

makes copies of Kylnvert.Ndx iRel-Inv. Ndx without the code field 



FIG JO J 
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Main- Put up Menu For User 

Confiq: Initializes Data I Variables 

UadDato: Open Files &Load Some Files into Memory 

AddSentence: Get User Query I Convert Terms Found to Codes 

ReadText: Replace Punctuation etc. in Query 

WordParse: Build Table of Query Words 

FindCombKey: Match Query Table Against Combined Term Vocabulory 

FirstLasf: Use Index to Find Range to Check in Vocabulary 

FindSingKey: Match Query Table Woids Against Single Wford Vocabulary 

& Strip Prefixes to Find Stem 

ShowExpn Display Matched Query Terms 

ShowQuery: Display Query 

SelectMenu- Get User Menu Choice 

OtherWords: Find I Display Related Terms I Get User Choices 

FindRelatives: Retrieves Relatives for Given Word 

BuildCombTable: Build CombTable of Relative Percentages Modified by 

Relative Document Freq's For All Pairs of Query Terms 

AddSwaps: Build Sorted (Ranked)Table of Swap Terms 

ComboSum: Calculate Multiplier For Swap I Document Values 

SelectRelatives: Get User Choices of Swap Terms 

DeleteWbrd: Delete Term From Query Table 

InsertWord- Add the Last Deleted Term Back Into Query 

AddSearchTerms: Display All Terms Beginning With A User Entered String 

of Letters K Get User Choices 

RankRecords: Build Sorted (Ranked) Tables of Documents 

ShowDoc;_ - -GeiUser-Choice After-Ranking-Documents 

ShowAbstr: Get Abstract Text From FiJe 
PrintAbstr : Display Abstract Text 

ShowReywords: Disploy Highest Weight Terms in the Document Chosen 
ShowHist: Display Histogram Showing Relative Document Values 
ScrolHist : Scroll Histrogram Horizontally 
ReWriteHist-- Scroll Histrogram Horizonta lly 



FIGJI 
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Abstract specs 



Sentence ENDS ALGORITHM: 



_ = space 

U = upper case letter 
I s lower case letter 
X = any character 
N s upper case noise word 

# = number 
E = .or?or! 

1) E_IJU ENDS 

2) E_[JI NOTENDS 

3) EJf ENDS 

4) JO NOTENDS 

5) AE NOTENDS 

where A ■ z8,ggf,inkl,vgl, grunds, ausschl, einschl, Kl, 
Bekl, Nr,Ger, BerGer, ff,subj ? obj 

6) LEN<8 NOTENDS 

Sentence RANKING ALGORITHM : 
WORD WEIGHT Formula: 

__JbstractWoraleightyalije«DocumentWeight-*-(PolyValue-.T25) 

SENTENCE VALUE Formula: 

SentVal != sum over keywords in sentence of AbstractWordWeightValue 

FinalSentValue! - SentVal ! /SOR (NumWordsinSent) 



FIG. 12 A 
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2. Assigning as many weights to each term as there are 

METHOD OF INDEXING AND RETRIEVAL OF documents that contain that term, where the weight of 
ELECTRONICALLY-STORED DOCUMENTS a term in a document depends on the number of times 

the term appears in the document, the number of docu- 
This is a continuation of application Ser. No. 5 ments that the term appears in, and the total number of 
97/998,023, riled Dec. 29, 1992; which is a continuation- terms in the document 

in-part of U.S. application Ser. No. 07/456,558, filed 3. Constructing for each term a rft^jf^-^W^^ar,.-^ 
Dec. 26, 1989, both now abandoned. ions of said term which list contains the terms (compan- J 

BACKGROUND OF THE INVENTION in V^^tappear™ same documents as said tefmin 

10 order of the sum of the weights of the companions over | 
1. Field of the Invention all documents that contain both the term and the com- 

This invention relates generally to document storage panidn. Associated with each companion is the compan 
and retrieval systems and more particularly to a method ion percentage which is the sum used to rank the com- 
of indexing documents so that they can be retrieved in panions. 

response to a query in order of their relevance to the 15 4. U sing the companion lists to construc t H ^H Y » ikta 
query. It also permits, general query to be easily modi- for each term w hicfTr dative lists usua ny"cqn^_onlv 
tied based on the content of the documents so that the those companions jghich also have saidlenn as a com- 



new query will retrieve documents that are relevant to pamon. Associatedwfth each relative is the relative 

the origi nal qu ery. percentage which is a weighted average of the compan- 
2. Description of the Prior Art 20 ion's percentage as a companion of the term and the 

Document retrieval based on indexing of the docu- term's companion percentage as a companion of the 

ments in a document data base is well known. Typically companion. The relative percentages are used to ranki 

the documents are indexed by creating an index file the relatives. j 

which records the documents that each word is in. 5. Assigning a "polysemantic" weight to each termT 

Then when the user inputs a query, the documents that 25 which polysemantic weight depends on the number of 

contain one or more words of the query can be quickly documents that the term is in, the number of relatives 

identified. However, if the query consists of general that the term has, and the relative strength of the first 

words that are not terms of art, the query may produce few relatives to the other relatives, 

u n sa t is fact ory retrieval results by either producing few 6. Presenting to the user, in response to a query, a list of 
documents that are of interest to the user or producing 30 "SWAPS" (Synthetic Word Association Pattern 

many documents that are not interesting to the user or Search) terms that are the best relatives to the entire 

both. group of terms contained in the query and allowing the 

SUMMARY OF THE INVENTION user to add one or more of the presented terms to the 

query. 

A principal object of the present invention is to pro- 35 7. Ranking the documents according to how many 

vide an improved method of indexing and retrieving query terms are contained in the document, their 

documents which: polysemantic weights and their weights in the docu- 

(A) allows a user to easily modify his query based on ments. 

the content of the documents so that the new query The present invention facilitates the rapid searching 

will retrieve documents that are of interest to the 40 of a document data base for documents that are of inter- 

oser; est to the user. By using the suggested SWAPS terms 

(B) accurately ranks the documents in order of rele- the user can modify his query so as to retrieve those 
vance to the query; and documents, if they exist in the data base, which are of 

(Q allows the user to peruse the documents ex- interest Since the SWAPS terms that are presented are 

tremely quickly. 45 in many of the documents that the original query terms 

Another object of the present invention is to use the are in, adding them to the query is guaranteed to re- 
Soft Boolean Connector concept to adjust the number trieve those documents and others co ntaining the 
_of hits (Le^ the number _of query awards that a document SWAPS terms. By using the SWAPS feature repeat- 
is credited with for ranking purposes) by giving less edly the user can in effect roam around the data base 
than a full hit to a word that often co-occurs with other 50 without actually retrieving and reading documents, 
query words. Another object of the present invention is Only after the query has been modified to include all the 
to use the Soft Boolean Connector concept to adjust the interesting SWAPS terms, dees the user need to actu- 
number of hits (Le. the number of query words that a ally retrieve the documents. The user can start with a 
word is credited with by virtue of its being related to poor query and modify it using SWAPS so that it be- 
those query words) for a possible suggested word by 55 comes a good query. The user need not waste time 
giving less than a full hit to a word that often co-occurs formulating a good query that will not retrieve any 
with the other query words. • relevant documents because there happen to be no such 

These objects, as well as other objects which will documents in the data base. The SWAPS terms that are 

become apparent from the discussion that follows, are suggested will always retrieve documents that contain 

achieved according to the present invention by the 60 them Le. documents that are likely to be relevant 

following steps (note: in the following the wards The ranking of the documents also facilitates rapid 

"term" and "keyword" stand for both a single word and searching because die user can be confiden t that the 

a phrase consisting of a group of words, eg., "patent highest ranked documents will be the documents that 

application".): are most relevant to the query and that all documents 

1. Indexing the documents by creating index files of 65 which have any relevance will be retrieved and ranked, 

which documents contain each term, how many times The foregoing and other objects, features and advan- 

the term appears in the document, and how many docu- tages of the present invention will become apparent 

ments each term appears in. from the following, more particular description of the 
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preferred embodiments of the invention, as illustrated in these programs: AIM, AIMPASS2, FREQCOMP, 
the accompanying drawings- RELATIVE, and POLYSEMY. 

BRIEF DESCRIPTION OF THE DRAWINGS JJhJKtt S^SS 

FIG. 1 is a block diagram of a computer system em- 5 IDF. DocKeys holds all of the Keywords and Key- 
bodying the present invention; word-Counts for all documents. IDF holds the docu- 

FTG. 2 is a view of the display screen showing an ment frequency, Le., the number of documents a key- 
entered query and the result of parsing it; word appears in. 

FIG. 3 is a view of the display screen showing sug- As the words in the documents are checked against 

gested SWAPS terms for the query of FIG. 2; 10 the vocabulary to see if they are keywords, the case 

FIG. 4 is a view of the display screen showing the (upper or lower) is possibly changed and they are 

modified query; stripped of prefixes to see if the different case or stem is 

FIG. 5 is a view of the display screen showing sug- a keyword according to the following algorithms: 

gested SWAPS terms for the modified query of FIG. 4; (UC=upper-case and LC=lower-case) 

FIG. 6 is a view of the display screen showing a 15 IF UC word is at the beginning of a sentence AND 

second modification of the query based on choosing we don't have rt in our vocabulary as a LC word THEN 

SWAPS terms from FIG. 5; look for it the Vocabulary as an UC word 

FIG. 7 is a view of the display screen as a result of IF UC word in middle of sentence AND we don't 

ranking the documents for the query of FIG. 6; have it UC THEN look for ft if it doesn't have a typical 

FIG. 8 is an operational flow diagram for indexing a 20 proper name ending 

set of documents; In USER Program Only: IF word NOT found 

FIG. 9 is a procedure tree for the QSEARCH pro- THEN find both the stem AND find the Good prefix 

gram used for searching an indexed set of documents (In the following "find" means that the stem and/or 

using the SWAPS and RANKING features; prefix is said to be in the document if the prefix is of the 

FIGS. 10A to 1QJ are description of the program 25 right type and the stem has the indicated length and is a 

modules in FIG. 8; keyword.) 

FIG. 11 is a description of the program modules in IF GOOD prefix THEN 

FIG. 9; and Find GOOD prefix if stem > 3 characters long 

FIGS. 12A to 12C are description of the AB- IF word is found THEN find if stem >8 characters 

STRACT program module. 30 long 

DESCRIPTION OF THE PREFERRED D» word is NOT found THEN find if stem >5 char- 

EMBODIMENTS IF POORpLix THEN 

This invention will now be described as embodied in If word is found THEN DONT find stem 

a computer system of the type shown in FIG. 1. This 35 If word is NOT found THEN find if stem >5charac- 

embodiment utilizes the following computer hardware ters long 

and software: List of Poor Prefixes: 

(1) IBM compatible personal computer with at least 4 hi, co, de, en, ex, im, in, un, re, con, eco, dis, epi, mal, 
MB of RAM, a large capacity hard drive, a display mid, mis, non, off, out, pre, pan, sub, uni, demi, down, 
screen, and a keyboard. 40 fore, hemi, high, meta, over para, peri, post, sel£ semi, 

(2) MS-DOS compatible operating system and UM after, inter, quasi, trans, under 
3.2 compatible expanded memory manager. List of Good Prefixes: 

(3) A vocabulary file of terms (words and phrases) air, bio, sea, sky, top, aero, and, auto, back, head, 

(4) A series of programs that index the documents by home, homo, hemo, mega, mini, mono, rear, poly, self, 
constructing various files that hold information 45 tele, viro, chemo, ferro, homeo, hyper, infra, intra, 
about which terms are in which documents, which macro, micro, multi, hydro, radio, super, supra, ultra, 
documents contain which terms, the weights of the contra, hetero, thermo, techno, nucleo, counter, elec- 

terras, and whic h terms are relatives of other terms. _ tro, magneto . 



by virtue of occurring in the same documents and The next indexing program is AIMPASS2.BAS. It 
how strongly are they are related. 50 creates Key and Weight files. The nth Rec of Key.Ndx 

(5) A user program that accepts a query, suggests contains NumKeysinDoc(n) followed by up to 127 Key 
modifications to the query, and ranks the docu- codes which have Weight greater than or equal to the 
ments based on the modified query using the Adaptive Threshold Value. The Adaptive Threshold 
weights and relative strengths of the terms of the Value is the average Weight value of the 80th Keyword 
query. 55 in each document (0 if there are less than 80 Keywords 

The Vocabulary file is structured as a list of head- in a document). The nth Rec of Weight Ndx contains 
words each with a short synonym list All of the syn- up to 127 (or as many Keywords are above the Adapt- 
onyms of a given headword are assigned the same code. ive Threshold Value) Document Weights computed 

The full list of indexing programs can be found in with the following weight formula: 
FIG. 10. Here we will describe the most important of 



LogzCPreqlnDoc + 1) X Logj f TotDoc* x , 15 

[ DocsWiAWord + 3 + 

Woght(Wonl) = ' 



Log2l 2 + 



Tt ^^OTdsInDoc ^ 
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FREQCOMP.BAS implements the Inverted Index 
access method along with the weighted values to calcu- 
late the frequent companions for each of the words used 
in the document collection. 

For each word ("A") in the. controlled vocabulary 5 
dictionary, the WEIGHT (see above formula) values 
for each co-occurring word in the document ^Qugo- 
o ccnrrmg word to A is nne that appear s as a Keywo rd 
in the same document th at A^appeaWasa Keyworri^are 
summed, along" with the WSGHT values for A in that 10 
document, respectively in all documents in which they 
co-occur. The sum values for each j^occurring w ord 
are converted to a percentage, scaled to the sum value 
for A (Le., percentage = sum for word's WEIGHT val- 
ues divided by die sum for A's WEIGHT values). Note 15 
that the percentages for the co-occurring words can be 
higher than 100% if they are heavily weighted in the 
same records in which A appears. The co-occurring 
words are then sorted in descending order (from highest 
percentage value to lowest) and the top 127 are written 20 
to a file (see below for structure). If there are 127 co- 
occurring words or fewer, then all of the co-occurring 
words will be written in descending sorted order.. 



words in the dictionary are not used at all in a database), 
take each Frequent Companion of A (called "F") and 
its Frequent Companion Percentage Value [FCPVal] in 
A's Frequent Companion List [FCList](called "F- 
VAL") and look for the FCPVal of A in Ps FCList 
(called "AVAL"). NOTE: If A is not found in Fs 
FCList, then A-VAL is zero (0). The RELATIVE 
value for F is calculated by multiplying the smaller of 
F-VAL and A-VAL by 6, adding the larger ofF-VAL 
and A-VAL, and then dividing that sum by 7. If both A 
and F are in each other's FC lists, the resulting Relative 
value will be added to both words' Relative lists. If F is 
in A's FC List, but A is not in Ps then F-VAL will be 
divided by seven and added only to A's Relative list 

After all the RELATIVE values are calculated for 
each Frequent Companion (F) in A's FCList, they are 
sorted in descending order and the top 63 of these 
words are written to A's Relative List If there are 
fewer than 63 Relatives, then all of the Relatives will be 
written to A's Relative list, in descending order of 
RELATIVE value. 

(based on above Frequency Companion Hie) 



Definitions: 

{} co-occurnng 
weight = WEIGHT value 

Example: 



•25 



Relative Value = 



(SmaPerPercent Value x 6) 4- LargerPercent Value 
7 (12 if not mutual) 





Doctf2 


Doc#3 


DocjH 


Doc* 5 


A 0.3 


A 1.5 


A 2.0 


A0i7 


A 0.8 


B 0.8 




B 1.9 


B U 


B2.5 



30 Here the SmallerPercent Value is the smaller of the 
A-VAL and the F-VAL and the LargerPercent Value 
is the larger of the A-VAL and the F-VAL. 



A{>B 

BOA 



c 
c 



D 
D 



E 
E 



5.5 — sum of weight of all A'a 

6.4 = sum of weights of B*s co-occurring with A 



Sample; 



35 



40 



A5.5{} B 6:4/116% 
B6U{} A4.0/639& 



TO _ (A(6ffl) X 6) + Boi6ft> 

Resulting File: 
Word Relatives . . . (sorted) 



A 
B 



B70... 
A70... 



Resulting File: 

Main Word 



Co-Occurring Words . . . (sorted) 



A 
B 



B 116% . 
A 63% . 



'45 



50 

After the frequent companions have been found 
RELATTVE.BAS is run to define the relatives of each 
Keyword (A) according- to the following algorithm: 
are there any FreqComps for A? If so, then for each 
FreqComp of A (F): 55 
look for F in A's FreqCom p List a nd get its value 
look for the word i ts elf (A) in word A's FreqComp 

'Ost_and_getits-value 
a apply formula of (Lowerx 6+ Higher)/?, where 
Lower is the lower of the two values obtained in 60 
the above two steps and Higher is the higher of 
the two values, 
a sort in the resulting list of words and values in de- 
creasing order, by value 
a save the first 63 (ox as many as are found) of this list 65 

as the relatives for keyword A 
For each word (called "A") in the dictionary which 
has Frequent Companions (not all do, because some 



After the relatives have been found each of the key- 
words is given a single polysemantic weight that does 
not change from document to document by the pro- 
gram POLYSEMY.B AS which uses the following for- 



Poly Value 



■Jc 



Avg3 X 



Avq 
Avgzo 



+ Avg6 X 



Avg6 
Avg63 



TotRdVal 
DocFreq 1 - 2 



Here Avgn is the average of the relative percentages 
of the first n relatives of the keyword, TotRelVal is the 
sum of relative percentages over all relative lists that 
the keyword is in, and DocFreq is the number of docu- 
ments that the keyword is in (having a WEIGHT above 
the adaptive threshold). 

Once the indexing programs have been run, the AB- 
STRACT program is run to create highlights of the full 
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text that will be presented to the user before or in place 
of the full text itself. First the documents are broken 
into sentences using a Sentence Ends Algorithm. Then 
the sentences are assigned weights (values) as a whole 
and the top ranked sentences are chosen to be part of 5 
the highlight Finally a Sanitize algorithm is used to 
"X" out (eliminate) proper names from in the high- 
lights. See FIG. 12 for specific details on the algorithms 
used in the ABSTRACT program. 

Once the indexing and optionally the ABSTRACT 10 
programs have been run, the QSEARCH program can 
be used to search for documents. This is done by enter- 
ing a query in natural language. The user program will 
parse the query to find all the keywords it contains 
using algorithms similar to those in the AIM program. 15 

After the query is parsed the user is shown the key- 
words that are contained in the query in order of their 
polysemantic weight and is given the opportunity to 
add and delete words in the query and to have the pro- 
gram suggest SWAPS terms based on the query. These 20 
SWAPS terms are generated by generating for each 
keyword in the vocabulary a summed-relpoly-percent- 
age which is the sum, over all terms that are in the 
query, of the relpoly percentages of that keyword, 
where the relpoly percentage is the product of the rela- 
tive percentage and the polysemantic weight Then the 
summed RelPoly percentages are adjusted using a con- 
cept called Soft Boolean Connectors to come up with a 
final SWAPS value for each keyword. The keywords 
are then ranked by SWAPS value and the highest 
ranked are presented to the user as suggested SWAPS 
terms to be added to the query. 

The Soft Boolean Connectors concept involves pe-' 
na lizing pairs of terms that cooccor often (Le^ in man y 
rirjgniTienrgj wfan re lr ^ g ^Tig th«* arijmtmfnt to be ap- * 5 
plied to the summed relpoly percentages. 



8 

-continued 

Example: 

if Relative values arc 
AB = 70 
AC« 75 
BC » 65 

and A appears in 5 rtommmrs and B appears In 4 and 
the avg. do c freo isA then using the 
following formula for AB 



1.77 - 



the table value for AB is 1.77 

(look below for maximum, making this 1.0 instead) 

for a hit of 3 words - 3 relatives of main word or 

3 words in a document 

ABC 

Adjust Value « tfhto-mm qf penalties) 




Note: The "# of hits" value is: 
4 

g ^PDlvValucofword 



where n is 3 in this example 



23 



30 



\ Avg FolyVame 
Word/Document Value - Temp Value X Adjust Value 

MAXIMUM PENALTY TABLE (SWAPS) 



query words 



Max. 



(for each pair) 



2 

3 

4& up 



(for mm of pairs) 



First, Multiply the last group of SWAPS words by 
Boost Factor (=2) 

Then add relative values of relatives of 
main word after each is multiplied by the 
PoryValueoftheWord 
(The previous value will be called "Temp Value") 
Create table for every pair combination of query words, eg., 
for words A, B, &C there are three pairs: 
AB 
AC 
BC 



2 
3 
4 
5 

6&np 



0.3 
1.0 
0.9 

0.3 
1.4 
1.S 
13 
2.8 



^0 

\$ A ftet-the_user has modified the query by cho osing 
S WAPS terms, he can have the program suggest new 
SWAPS terms based on th e new query. In this case the 
15 program boosts the relatiTC-percentages of the last 
chosen set of SWAPS terms before calculating summed 
relpoly percentages. This allows the user to navigate in 
the data base by modifying his query so that it will find 
documents containing the SWAPS terms. 

For example, FIG. 2 shows the options the user will 
be presented with after entering the query "when can a 
contract be enforced'*. If the user chooses the menu 
option "Related Terms" he will be shown a list of 
SWAPSjgrms as .shown jo FIG. 3. This first scTof 
SWAPS terms that arepresentcdTo the user includes 
the term "statutes'*. The user may choose one or more 
of these suggested SWAPS terms to add to the query. 
In FIG. 4 we see that the user has chosen to add the 
term "statutes" to the query. At this point the user can 



For each pair of query words, ("A" & "B") , the Rela- 
tive Value used in the formula below is B's Relative 
Value in A's ftrfgtive T J $t or. if B doesnt app ear in A's 
Relative, List, then the value is taken from A's Relative 
Value in B's Relative List (this is possible because the 
Relative Value between any two words is mutual), Le., 
if B is found in A's Relative list, take just that value. 
You don't need to look at B's list to find A's value there 
because, if it is there, it would have the same value as B 
has in A's list Only if Bis not in A's Relative list check 

for A in B's list Enter the Relative Penalty value result- „ , t _ - _ 

ing from the following formula into the table for each 60 the system to suggest SWAPS words, Tins 



50 



55 



combination (pair): 

Relative Value X 

\ avg doc frcq of the respective combination (eju A & bT 

N avg doc freo of afl wordsfa the database 

42 



time the previously added SWAPS term "statutes" will 
be given extra weight in determining which new terms 
are suggested to the user. In FIG. 5 we see the resulting 
suggested SWAPS terms generated from the four query 
65 terms "agreement", "statutes", "enforcement", and 
"can", with "statutes" given more weight than the 
other three terms. Notice that the SWAPS words are 
ranked somewhat differently than in FIG. 3 and in 
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particular a new SWAPS term "statute of limitations" is 
suggested. By adding the term "statutes" to the query 
and then asking again for suggested SWAPS terms the 
user has "moved" the query to "an area of the database" 
that contains documents dealing with "statute of limita- 
tions", which is a term of art that makes the original 
query more focused and is likely to find documents that 
are relevant to the intent of the original query. Here the 
fact that both terms "statutes" and "statute of limita- 
tions" contain the same word is fortuitous. It is the 
meaning of the term "statutes" which makes it a close 
relative of "statute of limitations" by virtue of the feet 
that these two terms co-occur in many of the same 
documents. 

Once the user is satisfied with his query he asks the 
program to retrieve documents that are relevant to the 
query. In FIG. 6 he would choose the View Documents 
option. The system will then use its index files to assign 
a value to each document and then rank the documents. 
The documents are ranked by generating for each docu- 
ment a summed-weightpoly- value which is the sum, 
over all terms mat are in the query, of the weightpoly 
values of that keyword, where the weightpoly value is 
the product of the weight of the keyword in that docu- 
ment and its polysemantic weight Then the summed- 
weightpoly values are adjusted using the Soft Boolean 
Connectors concept to come up with a final value for 
each document The documents are then ranked by 
value and presented to the user in order of rank. 

The Soft Boolean Connectors concept involves pe- 
nalizing pairs of terms that co-occur often (Le. in many 
documents) when calculating the adjustment to be ap- 
plied to the summed relpoly percentages. First, multiply 
original query words by 

Boost Factor (=2) 

Then add WEIGHT values of key words in a docu- 
ment after each is multiplied by the Poly Value of the 
word 



35 



(The previous value win be called "Temp Value**) 

Create table for every pair co mbi nation of query words (ABC) 

AB 

AC 

BC 



40 
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/K For eac h pair of query words, ("A" & "B'Q, the Rela- 
^> _ tive JVame^used in tue formula below is B's Relative 
Value in A's Relative list, or, ifJLdo^rtap|X3nMn A's 
RelativeXist ^Jhen the value is taken from A's Relat ive fi 
Val ue in B's Relative LisT(this is po ssible because the 
Relative_V^lueLberween.an^^ Le., 
if B is found in A's Relative list, take just tnaTvahie. 
You don't need to look at B's list to find A's value there 
because, if it is there, it would have the same value as B 55 
has in A's list Only if B is not in A's Relative list check 
for A in B's list Enter the Relative Penalty value result- 
ing from the following formula into the table for rarh 
combination. 

60 



65 



if Relative values arc 
AB « 70 
AC =» 75 
BC = 65 

and A appears in 5 documents and B appears in 4 

and the avg docfinap fcA 

then using the following formula tor AB 



10 

-continued 



165 . 



70 X 

N4 

28 



10 



15 



20 



25 



the table value for AB is 2.65 

(look below for maximum, mat-wig this 1.0 inst ea d) 

for a hit of 3 words - 3 words in a document 

ABC 

Adjust Value » l.#* flf Mk-jwh of penalties 

Note: The "# of hits* 1 is the same as above for the SWAPS. 

Word/Document Value = Temp Value X Adjust Value 



MAXIMUM PENALTY TABLE (RANKING) 
query words Max. 



(for each pair) 




2 


0L3 


3 


U 


4 


12 


5&up 


1.1 


(for sum of pairs') 




2 


as 


3 


1.6 


4 


1.9 


5 


2.3 


6 & up 


2.8 



30 



To facilitate very rapid perusal of the ranked docu- 
ments, the document values (used in the ranking) are 
presented as a bar graph as shown in FIG. 7. Also the 
documents are presen te d in 3 forms. The first form 
consists of a ranked array of the highest ranked terms in 
the document that requires only about § of the display 
screen (FIG. 7). The second form consists of a program 
generated "highlight" of the document which consists 
of very short portions of the document of less than a 
dozen words that contain the highest ranked terms. This 
highlight scrolls in about § of the screen and is shown 
along with the array of highest ranked terms. The third 
form consists of the full text of the document which can 
be scrolled. The user can use arrow keys to move rap- 
idly from one document to the next 

Appendix 1 contains the full BASIC program source 
code that implements the preferred embodiment de- 
scribed above. This code must be compiled using the - 
Microsoft 7.1 BASIC compiler to produce object mod- 
ules which must then be linked along with libraries 
containing object code for assembler routines from the 
Crescent Software QuickPak Professional Advanced 
Programming Library for BASIC Compilers Version 
4.12 to produce an executable file. 

There has thus been shown and described a novel 
document indexing and retrieval system which fulfills 
all the objects and advantages sought therefor. Many 
changes, modifications, variations and other uses and 
applications of the subject invention will, however, 
become apparent to those skilled in the art after consid- 
ering this specification and the accompanying drawings 
which disclose the preferred embodiments therefor. All 
such changes, modifications, variations and other uses 
and applications which do not depart from the spirit and ' 
scope of the invention are deemed to be covered by the 
invention which is limited only by the claims which 
follow. 
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3EFINT A-2 

AIM: Automatic Indexing nodule Revision: 8.05 VN S 



1 check U-case word against Lease also (CheckUppercaseWord) 
■sinoude: •\user\include\7ypes.bi* 



TYPE CodeType 

Code AS INTEGER 

END TYPE 

type DoclndexType 

NdX AS LONG 
Hun AS INTEGER 

'Tot AS INTEGER 1 — Total Number of keys in doc for DJ (shorter docs) 

Tot AS LONG ' — Total Nusber of keys in doc for WEST (longer docs) 

Pad AS STRING * 6 '— need to pad it out to 16 bytes for > 12,800 docs 

1 when using Tot as LONG 

END TYPE 

TYPE DccKeyType 

Code AS INTEGER 
f req AS INTEGER 

END TYPE 



TYPE Str47 

Str AS STRING * -7 

END TYPE 
TYPE Str18 

Str AS STRING • '8 

END TYPE 

TYPE NewCoabType 

Str AS STRING * 45 
Code AS INTEGER 

END TYPE 

TYPE NewSingTypc 

Str AS STRING • 16 
Code AS INTEGER 

END TYPE 

DECLARE SUB CheckUCaseUorc (Words, PrevUordS, length*, NoNaoeFlagX) 
DECLARE SUB Config (NachireS. first*, Last&) 
DECLARE SUB DispNsg (HsgS, rS, cX) 
DECLARE SUB DispStat CsS) 

DECLARE SUB EasAlloc (NusPagesZ, Handle*, LoadFILES) 

DECLARE SUB FindCoabKey (KordEHSX, NutiUordsft, KyEHSZ, NusKyZ, ConbKeyEHSX, NuaCoabKeyX) 
DECLARE SUB FindSingKey (i*ordEHSZ, NuBVordsft, KeyENSX, HusKey*, SingFoundEMSX, MuaSing?3und&) 
DECLARE SUB pause (TiksT) 

DECLARE SUB ReadEnglishText (FirstLin**, LastLineft, Handle!, LinZ) 
DECLARE SUB ReadGernanText CflrstLineft, LastLlneft, Handle*, LlnX) 
DECLARE SUB Read Sect ion (TxtS, SecArraySO, ArtArraySO) 

DECLARE SUB UindHgr (OLRo-X. ULColX, LRRowX, LRColX, FraeeX, BoxColrZ, TextColrZ, Texts) 

DECLARE SUB UordParse (TextMandLeZ. Lines*. wordHandleX, itordsi) 

DECLARE FUNCTION First Las t= (words. First*. LastZ. KeyTypeX) 

DECLARE FUNCTION LoadlntoESS* (F1leS) 

DECLARE FUNCTION NUOtS Cxi) 

DECLARE FUNCTION XLateX CxS) 

«■ . External routines 

' SINCLUOE:_L\USER\INCLUDE - DSCLARES.BI1 ' 



• PROGRAM STABT 



CONST Sing » 6, Comb » 1 

CONNON SHARED NomAttr, RevAttr, Filei, SecCode, Art Code 
CONNON SHAREO DocDirS, LstDirf, AtListS. LangS 

CONNON SHARED XLateTebleZC), SingTableZO, CoebTableXO. NiBSO, LENNLEO 
COWON SHARED ThirtyTwo, Sixty Four, SlxteenX, Thirty TwoXS 
CONNON SHAPED ASCEND, DESCEND, FALSE, TRUE 

CONNON SHARED ENTER, ESCAPE, ASCa, ASCZ, ASCupperA. ASCupperZ, ASCslash. ASCO, ASC9. ASCat 
CONKON SHARED CodeTeap AS Code Type, LENCode, LenCocb, LenSing 
CONNON SHARED Rua£eM>1 Keyword . NumComt»2lCeyvord , NuaCoob3Keyword 
CONNON SHARED ConbllCeyvordEHSX, Co<nb2KeywordEftSX, Coab3lCeyword£flSS 
COftfttN SHARED NuaSlngl Keyword, NuBSing2Keyword 

CONNON SHARED SinglKeyworcENSX, Sing2Keyw©rdB1SX, PrefixesSO, NeanPref ixesSO 
CONNON SHARED KuabersSO. SectlonSO, ArticleSO, ParagrephSO, ArtikelSO 
CONNON SHARED Not*ase£ndingS(). NajneEndingSO 
NLGData; 

DATA -or%-B^","aessrs-.*sen\''rep», B os% B dr-,-drs- 
EnglishData: 

-DATA -Section-,»Sec».-Sec.- 
DATA>^Art icle" , "Art . " # "Ar^" 

GernanOatai 

DATA "Paragraph*. -Par". "Par.-, -Para. - 
DATA •ArtikelVArt","Art." 

Nu&Data: * 
DATA •I","II ,, ,"III - ,*IV ,, ,-V-,*VI - ,*VII". - VIII-,-IX".«X - 

data -xr,Txii- f -xiir,-x:v-.-xv. •xvi-,-»ir,»mir,"xix-,-xx- 

data ■xxi-,°mi B /xxii:-.-xxtv«,»xxv-,-mi\^own\wiii%'nu-,-m- 
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NaoeEnd: 

DATA B \no\".^er\i^\a»\ey\t2\","\oan\soo\ong\haa\ton\scn\ 0 /\>Bann\tein\" 
NoNaaeEnd: 

DATA "NaU-.-ViQoValsXnv'^^-XBentXionaNencyKancyXnasaS' 1 
DATA •\Bents\ional\n:e»\encie5\enciw\ ta , ,, \aental\" 
*$ INCLUDE: ^user\inc luce -.pref pat. bi 1 



Config ItachincS, FirstDecl, LastDoca 

RevS - "SRevision: 5.12 VN S" 

Rev* « rtIDS<Rev$, 14, 7) 

o12! » 0 

a14« a 0 

as 

WindMgr 2, 2, 24, 79, 2, VjraAttr, fievAttr, "Autonatic Indexing Nodule (Rev " ♦ RevS - ">" 

• : : ; Read in Coabinsd Kays 

IF LengS « "GERMAN" THEN 
LenCoab » 66 '62-2 

ELSE 

LenCoab = 47 '65*2 

END IF 



'it had to divide coabke/.str to 3 files because array can't be > 128k 
•and there is no oenory to make elenent size power of 2 

leadFILES = LstOIrS ♦ "CD-3KEY1.STR- 

IF NOT ExistXCloadFILES) THEM OS : PRINT LoadFILES; " not found.*: END 
NumCoablKeyword = FUeSizeaCLoadPILES) \ LenComb 

Dispflsg "Loading " ♦ NuaSlNuaCcabl Keyword) ♦ " Coabined Keywords", r, c 
CoablKeywordENSX = LoadlrioEnsXCLoadFILEJ) 
Oispnsg r, c 

LoadFILES = LstDirS ♦ "C0-3XEY2.STR" 

IF NOT ExistX(LoedFILES) THEN CLS : PR HIT LoadFILES; " not found.": END 
rtuaConb2Keywoni - FileSizHtLoadFILES) \ LenCoab 

Disphsg "Loading " ♦ Nuas:!tusCoab2Keyvord) ♦ " Coabined Keywords", r, c 
Coab2KeywordEnSX - LoadlntoENSttLoadFILES) 
Disphsg "", r, e 

LoadFILES « LstDirS * "C0-SKEY3.STR" 

IF NOT ExistXtLoadFILES) THEN CLS : PRINT LoadFILES; ■ not found.": END 
KunCocb3Keyword = FileSi*ei< LoadFILES) \ LenConb 

Disphsg "Loading " ♦ Muffli:«uaCcob3Keyword) ♦ " Coabined Keywords", r, c 
Coab3KeywordEHSX = Lcadlr.toERSttLoadFILES) 
DispHsg "", r, e 

• Read Single Keys 

LoadFILES * LstDirS ♦ "SiSOCEY.STR" 
IF NOT Cxi stX( LoadFILES) THEN 
CLS 

PRINT "AIH Fetal Error:" 
PRINT 

PRINT LoadFILES; * not found." 
"PRINT 
chiaa 10 

PRINT "Press the SPACE BAR to exit:" 
iS = INPUTS (1) » 
END 

END If 

DIN SingKeyTesp AS STR32 

KuoKeyword = FHeSize&OoaaFILES) \ Thirty Two 

DispHsg "Loading ° ♦ NuaSiNuaKeyword) ♦ " Single Keywords", r, c 

SingKeywordEHSZ ° LoadlntcERSS(LoadFILES) 

DispHsg r, c . 

• Bead 3-Char Tables 

Syeb - 28: First • 1: Last = 2 
REDIN XLaUTableX(38 TO 122) 

RED IN SingTableXd TO Syao, 1 TO Syab, 1 TO Syab. 1 TO 2) 
REDIN CoabTabieXCI TO Syab, 1 TO Syab. 1 TO Syob, 1 TO 2) . 

XLateTsblo(47) « 1* / char, as used in non-wildcard words ■- 
XLateTableOS) a 2* • char, as used in SSP, AftP, etc. 
FOR i « ASCa TO ASCx 

XLateTable(i) » i - 94* so that a=3, b=4 r=28 

NEXT 

FSetAH LstDirS ♦ "KETVORD.TBL", SSG StngTableZO, 1, 1, 1), (4 * 28), (28 • 28) 
FSetAN LstDirS ♦ "KETCOJS. T3l" , SES CccfcTableZd, 1, 1, 1), (4 * 28), €28 • 28) 

' Open Docuaents 

' — actual count for tota* nuaber of docuaents 
Counte = FIleSittKDocDirS ♦ ".KDX") \ 8 

OPEN DocOirS ♦ ".NDX" FOR RANDOM ACCESS READ SHARED AS #8 LEN ■ 8 
OPEN DocOirS ♦ ".TXT* FOR RAN DON ACCESS READ SHARED A3 #9 LEN B 80 

•DocKeys hold ALL of the <eywords and Counts for all docuaents 
, 0ocNdx(x).NDX points at the elenent in DocKeysO where the Keys for 
•Docuaent On are, -.NUN is how aany keys are in the docuaeflt, 
'and .TOT is the total sua of all occurrences of keys 
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'score ess for dockey and docndx-no need anymore because we write to disk each doc. 

•4096 eleoentc per 16k E-.S page CA bytes per elesent) 

•NuftPages = 600* HaxDocXe.ci \ 

'KaxDocKeysS = ttNGCHua?a=es) * SixteenK \ 4 

'QPrintRC STRS<NuiBPages> * * EMS pgs for DocKeys.", K 2, -1 

•QPrintRC "NaxDocKeys: - * STRSCHaxDocKeysft) . 4, 59, -1 

•EnsAllocNea NuaPages. DocKeysEHSX 

•IF EnsErrorX THEN PRINT -Couldn't allocate 1 *; CLNGCNuoPeges) * SixteenK; "bytes of ENS for OocXeysEHS.": Chine 2: STOP 
•EHSNdxPages * 20 '360k can hold 36000 doc 
'EmsAllocHea EHSNdxPages, DoeNdxEHSZ 

'IF EosErrorX THEN PRINT "Couldn't allocate**; CLN6 ( EHSNdxPages ) * SixteenK; "bytes of ENS for DecNdxEHS.**: Chine 2: STOP 

CurrOocKeyS ■ 1 * current pointer into DocXeysO 

Debug - FALSE 
Startft ° FirstOocft 

' — only use the LastDoeft paraoeter if there are at least 

* that oany docuaents in the database, otherwise use the 

* actual number of docuaents 

IF Count* > LastDoeft THEN Count* = LastDoeft 

.'— create the DOCKETS. aw file for output 
FCraate LstDirS ♦ "DOCKETS -AH* +■ HaehineS 
FOpen LstDirS ♦ "DOCKETS. ah" ♦ Machi neS. DocKeyFileX 
IF DocKeyFileX = -1 THEN 

PRINT "Can't create DOCKETS. AH" 

chine 2 

END 

END IF 

FCreate LstDirS + "DOC INDEX- AH* ♦ NachineS 

FOpen LstDirS ♦ "DOC INDEX. AH" ♦ NachineS. DocNdxFileX 

IF DocNdxFileZ * -1 THEN 

PRINT "Can't create DOCIN0EX.AH" 

chine 2 

END 

0*0 IF . 

•AHSavePointft «■ 1 intermediate save point 
■NdxPoint * Startft 

DIN DocNdx AS DocIndexType 
LenDocNdx = LEN<DocKdx) 

DIN DictTenp AS DietType 'only for getting size dependening only on Diet Type 
DictUordNua = FileSiteft< LstDirS * "D1CT.VRD") \ LEN(DictTeap) 

RE0IH Idf&<1 TO OicttfordNu) ' total nuriber of word-codes idf shows numofdoc where tn:s word is 
IF NOT Debug THEN 

* — open a new file for statistics 

OPEN LstDirS + "STATS" *■ NachineS ♦ ".TXT" FOR APPEND SHARED AS #5 

PRINT #5, • " 

PRINT OS. "Start 'ise:"; TINES; Tstart! 
PRINT OS. "* 

ELSE 

OPEN "\DEV\NUL* FCR OUTPUT AS 05 

END IF 

Proc = 0' number of files actually processed 
TStart! = TWER 

naxTeopKey ■ 1000 'nuebe- of unique keys in one doc. if nore then redim preserve 

om Doc AS ISAHtype 
CodeStringS ■ "* 
Freest rSpa » FRE(""> 



***** START 1 
FOR File* » Start* TO Cou-t3 



RED IN DocKeyTeapCI TO RaxTenoKey) AS DoeKeyType 
LENDocKeyTee© « LENCDocKe/TespCD) 

PRINT J», "File 0"; HIDSCSTRSCFUeft), 2) 

GET #8. Fi left, Doc 

NusLines ■ Doc. Last - Doc. First ♦ V lines of text 1n the file 
* Allocate ENS to hols the Text file 

NuatPages » 80ft * NumLines / SixteenK + 1 ' 80 bytes per line. 16k per ENS page 
CALL EissAUecfleaCNunPages. TextENSX) 

IF EosErrorZ THEN PRINT "Couldn't allocate"; NumPages * Sixteen*; "bytes of ENS.": 5T2=> 

• Al Locate ENS *or the parsed word array C32 bytes per word) 

KujtPages ° CL«G<Nua»ines5 » 12 \ 512 ♦ V sax nuaber of words (512 per 16K ENS page) 
CALL f£osAUocttea(NuiaPages. UcrdEHSX) 

IF EraErrorX THEN PRINT "Couldn't allocate"; NuaPegea * SixteenK; "bytes of BIS.*: STC» 
Row o 4 

CALL tfind»gr(8ow, 30, Ro* * 1, 50, 2, RevAttr, RevAttr, "CURRENT DOCUNENT") 
QPrintRC nIDS(STRS<Fileft), 2) ♦ »/■ + RIOSCSTRSlCcuntft). 2>, Row ♦ 1, 35. -1 



Read Docuaent into ENS eliminating blank lines 
strip non-alpha chars and crunch extra spaces 



IF LengS ■ "ENGLISH" THEN 

ReadEnglishText Dec. First. Ooc.Last. TextENSX. NusLinesZ 
ELSE IF LangS = "6ERNAM" T«EN 

ReadGeraanText Dec. First. Doc. Last, TextEHSX, NunLinesX 

END IF 
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Parse Text into Uords 



♦ Kuavords ° nusber of wcrca parsed 

WcrdParse TextEMSX, NuoLines, UordEMSX, NutfJordsft 

EosfteWeo TextEMSV release handle of OOC file, not needed enyoore 



Find Ceabined Keys 



• 5CK ratio for expected cosbined keys 
*. 8192 keys per page (16384 page size/2 keyeode size) 
Coabftatiol ° .5 

DispStat HIOS<STRSCCL«tNi*wordsft * CoobRatio! )), 2) ♦ ° Coabinad Keys allocated." 

NuaPages = Kuawordsft • CoabRatio! \ 8192 ♦ 1 
EasAlloc MuaPages, Cc*bCoceE*SX r "CcabCodeEMS- 
a11l = TIMER 

FindCoabKey UonJEMSX, KuaVcrdsft, CoehKeywcrdEMSX, ItuaCosfeKeyword, CoBbCodeEMSX, NuaCoaeCoae 
a12! = a12! ♦ TIMER - all! 

DispStat NuaSCNuoCoabtode} ♦ " ceabined keys found." 

• Store the Coobined <ey\*5rds in DocKeysEMS 

DocMdx.Ndx a CurrOocKeyft 
Indexft s CurrOocKeyft 

'If this is a first appearence then put coda to OocKeyTeap, i.e. store 1t and 

'Increase miaber of docusents contained ttiis word (IDf) and increase str1ng(we 

■keep previous codes as string C\**\**\**...» 

•if not add frequency this word in this particular docuaente 

k = 0 

FOR 1 « 1 TO RuaCoabCoOe 

EasGetlEl CodeTes;. LENCode, i, CoabCodeEBSX 
CodeStrS ■ *V ♦ -KIJCCodeTcp-Cocc) 
Codefos » IMSTRCCodeStringS, CodeStrS) 
IF CodePos » 0 TVS.1 
k - k * 1 

IdfftCCodeTeap.Code) = Idfft(CodaTenp.Code) ♦ 1 
DocKeyTeaott).Ccde « CodeTeap.Coda 
OocKeyTeeoCO.Freq * 1 

since it wasn't found, it's neu so we'll put it at .CurrDocKeyft 
• and i»rcreaent CurrOocKeyft to point to the next avai labia slot 
CurrDooce.ft = CurrOocKeyft ♦ 1 
CodaSt ringS = CodeStr ingS + CodeStrS 

ELSE 

CodePc* a CodePos \ 3 *■ 1 

DocKey7*«(CodePos)-Freq * OocKeyTeap ( CodePos ).Freq f 1 

END I F 

NEXT 

EasflelHeo CaabCodeEMS* 

•QPrlntRC STRS(CurrDc=Ke»ft - 1), 3, 70, -1 

• i ! i : • Find Single Keywords 

Singftatio! * 1.3 'we nees aore space for word with and without prefixes 
DispStat LTRIrtS(STR$CONetJ<u»worC5a * Singftatio!))) + ■ Single Keys allocated." 
NuaPages » HuaKordsft * SingRetio! \ 8192 ♦ 1 '—8192 keycodes per 16k EMS page 
EasAlloc NuaPages; SingCoceEMX, •SingleCodeEHS* 
a13i « TIMER 

FindSingKey WordEMSX, Nus^erdsft, SingXeywordEMSX, HusKeyword, SingCodeEMSX, MuoSingCoc*ft 
aH! » a14! ♦ TIMES - a13; 

DispStat LTRIh^(STRSChuaSingCodeft)) + ■ single keys found." 

• don't need the parsed word list onyeore so release the EMS 

EssRelHea UordEHSZ 

'the sane as for ceabined keywords: 

— *if this-io a first appearance -then put: code to.DocKayTeap,t.a. store it -end — 

'increase number of docuaants contained this word (IDF) and increase string (we 

'keep previous codas as string t\**\**V*. . .)) 

•if not. add frequency this word In this particular docuaente' 

kft = L£N< Cods-Strings) \ 3 'nuaber of unique keywords 
FOR it = 1 TO NuBSIngCoceft 

EosGet CodeTeap, LENCode, 1ft, SingCodeEMSX 

CodeStrS a -\- ♦ was (CodeTenp. Code) 

CodePos a MSTRCCodeStrlngS, CodeStrS) 

IF CodePos o 0 TfcSN 
k& = kft ♦ 1 

IF kft > raxTeepKey THEM REDIM PRESERVE DocKeyTeapCI TO kft) A3 0ocKeyT 7= « 
IdflCCocVenp.Code) = Idfft(CodeTeap.Code) ♦ 1 
DocKeyTeaD(kft).Code » CodeTeap.Code 
OocKeyTeootkW.Freq ° 1 

since it wasn't found, it's new so we'll put it at CurrOocKeyft 
1 and increaent CurrDocKeyft to point to the next available slot 
CurrOocKeyft ■ CurrDocKeyft ♦ 1 
CodeStrirgS • CodeStr ingS ♦ CodeStrS 

ELSE 

CodePos » CccePcs \ 3 ♦ 1 

OocKeyTe^lCacePcs).Freq ° DoeKeyTeap<CcdePos).Freq ♦ 1 

END IF 

MEXT 

'QPrlntRC STWCCurrOocKeyft * 1), 3, 70, -1 

Oocfldx.Kua » CurrDocKeyft - I**x» 
DocMdx.Tct - feiaSingCodeS * scaCoebCode 

DispStat HIDS(STR$(ftiBS1ngCacea ♦ KuaCoabCode) , 2) t " Keywords. * 
DispStat MI»tSTRSCCurrOocKe..ft - Indexft), 2) ♦ " unique Keywords. - 
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•release eenory for the coached end single keywords in EHS 
Eos Re I Km SingCodeEHSX . 

'store DockeyTevp Array on r~sk , there is not enough senary to keep it in RAfl 
'and there is no sense to pwt it to EMS and later read back to RAH 8 put to disk 
kS - LEMCodeStringS) \ 3 
FOR i& * 1 TO left 

CALL FPut RT( DocKey : ~lt?*, DocKeyTeapCi*), Index* * iS - 1 r LENDocKeyTesp) 

NEXT 

CALL FPutRT(DocNdxfile2 4 Dor**. CLNGCFile*), LenOocNdx) 'write ndx for each docuoent 

IF IKKEYS = CHRt(27> THEN 
chime 10 

DO: LOOP UNTIL INXT'S = "" 
X$ = INPUTS(I) 

IF xS = O0»(13) TrzS Proe = File*: File* a Count* 

END IF 

CodeStringS » "■ 
FreeStrSp* = FREC") 

•— do an Interned lax e save e.ery 100 docs 

if (Fiiea mod 100) = o then 

DispKsg "Intermedia:- save at" ♦ $m<FHe&>, r, e 

' '— dunp out the pe-tion of the DocNdxO used (i.e., froo Start to File) 

• FOR iS = Kdx Point ~2 Pile* 

CALL EnsSettSsS DocNdx, LenDocNdx, i&, DocNdxEflSH 

• CALL FPutRTCDocNdxFi le%, OocHdjc, i*, LenDocNdx) 
1 NEXT 

• NdxPoint - File* 

FPutAH LstDirS ♦ "XCINDEX.AH" ♦ HaehineS, S£6 DocNdxCStart*), LENCDocNdxCStartS;;, File* - Start* ♦ 1 

• . • 

• ' — save OocKeys the last save point 

• FOR i* 9 AHSavePoirti TO CurrDocKey* - 1 

CALL E&sGe::Ss5 OocKeyTeop, LENOocKeyTeop, i&, OocKeysEHSX) 

• CALL FPuta*; DocKey FiLoX, OocKeyTeop, it, LENDocKeyTeop) 

• NEXT 

• ' — set the save pc-r.t to the next dockey 

• AHSavePointS = Cur-5cc<ev8 

»— save the IDF. Sis 

CALL FPutAH ( LstDi rS - "IDF" ♦ Machines ♦ ".SIM", SEG Idf*(1), -2, U8OUX0(Idtt)> 
Dispftsg 0, O 

END IF 

oispstat "" 

next Filet 

CLOSE #8, U9 

IF Proc <> 0 THEM Countfi » »~z 

TEnd! * TIMER • ■ 

PRINT 03, 

PRINT S5. "End Tine:"; TINES; TEnd!; (TEnd! - TStart!) / Counts.; "seconds per document." 
CLOSE 05 

TotalTime! * TEnd! - TStart; 

IF TotatTioe! « 0 THEN Toto.-ia*! » TotalTioe! ♦ 86400! 
PRINT TotalTime! / 60; "eriiwtss elapsed tine." 
PRINT "FindCoobKey'; a12! 
PRINT "FindSlngKey"; aW! 

• 

'release neaory for COMBKEY and SIN6KEY lists and the D1CT.COO list 

[F SirtgKcywordEMSa THEN EttRelHea SingKeywordERSX 
IF CosblKaywordEHSX THEN E*sRelfteo Ooebl Keyword EftSX 
IF Co«b2KaywordEHSX THEN EosRelHea Coob2KeywortlENSl 
IF CoobSKeywordEHSZ THEN EssRelftea CoetoSKeywordEMSX 

IF SinglKeywordBiSX THEN EesRelRea SlnglKeywordEHSX ■__ 

"IF Sino^KeywOrdEHSXTHEH EasRelHea Sing2KeywordENSX 

'decreaent current pointer so that it points to the end of the array 
'not at the next available space » 
lurrDocKeyS = CurrDocKey* - 1 

* Save storage to disk 

IF NOT Debug THEN 

OlspHsg "Saving IDF* ♦ HaehineS ♦ ".SIN", r, c 

FPutAH UtDirS ♦ "Idf" ♦ HaehineS ♦ ".SIN**, SEC IdtMD, -2, U80uUD(Id«) 
Disphsg "% 0, 0 ? • 



Dispftsg "Saving Ooelndex.AH" ♦ NachineS, r, e 
FOR i* - NdxPoint TO File* . 

CALL EasGet(SES DocHox, LenOocHdx, i&, DocNdxENSX) 

CALL FPu^RTCDocNdxFileX, DocNdx, i*, LenDocNdx) 

NEXT 

FPutAH LstDi rS * "DOC INDEX. AH" ♦ HaehineS, SEG DocNdxCStart*}, LEN(0ocNdx(S:ar:i)), Count* - start* ♦ 1 
Dispftsg "\ 0, 0 

Dispftsg -Saving Doeiceys.AH" ♦ HaehineS ♦ ■ froa EHS", r, c 

FOR 1* 3 AHSavePointS TO CurrDocKey* 

EosSet SES DcdCeyTenp, LENOocKeyTeop, 1*, OocKeysEHSX 
FPutRT DccKeyFUeX, OocKeyTeop, LENOocKeyTeop 

NEXT 

FClose OockeyPUe*. 
PClose DocNdxFile'. 



EN0 IF 
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•EosRelneo DodCeysEnC release memory *°r Doe Keys 

Dlspnsg 0, 0 
chime 10 
•SPage 

END 

SUB CheckUCeseWord CVorcS. PrevUordJ, Length, HoNameFleg) STATIC 

NoNaneFLag = FALSE 

FORi*1T0 4 

IF INSTRCNeaeEndingSli), HIGHTSCUordV i» THEM 
IF Lengtn > 9 THEM 

hofOmeFleg = TRUE 
EXIT sua 

END IF 

END IF 

NEXT 

IF Length > 7 THEM 

NoMameFlag = TRUE 
EXIT SUB 

EN© IF 

IF INSTRC-\a/AanAafly\f»/\this\such\Bany\several\-, "\" ♦ PrewordS ♦ "\") THEN 
IF Length > 3 THEN 

NoMameFlag = TRUE 
EXIT SUB 

END IF 

END IF 

IF LCASESCPrevUordS) » "the" AND Length > 4 THEN 
NoNaaeFlag = TRUE 
EXIT SUB 

END IF 

FOR 1 » 2 TO 6 

If INSTRatoNaaeErriingSO), RIGHTS<UordS, D) THEN 
IF Lengtri > 2 THEN 

hoNaaeFlag • TRUE 
EXIT sua 

END IF 

END IF 

NEXT 



SUB Conflg (Machines, First*. Last8) STATIC 

IF NOT EmsLoadedX THEN 
CLS 

PRINT "AM Fatal Error:" • 
PRINT 

PRINT "No ENS drive- was found." 

PRINT 

chime 10 

PRINT "Press the SPACE BAR to «xit:» 

i$ • wpuTsm 

END 

END IF 

LENCode a LEN(CodeTesp) 
SixteenK e 16 * 1024 
SixtyFour a 64 
Thirty Two = 32 

Thirty Twott e Thirty Two « 'C2« 

ASCEND a 0 

DESCEND a MOT ASCEND 

FALSE = 0 

TRUE = NOT FALSE 

-ENTER * 13 - . 

ESCAPE a 27 
ASCa a ASCC'a") 
ASC2 = ASCC"Z") 
ASCWpperA a ASC("A") 
ASCupperZ » ASC("2"> 
ASCslash a ASC<"/") 
ASCO a ASCCO") 
ASC9 = A3CC-9") . 
A3 Cat - ASCCa") 

CodS - QPTHaS(COKHANDS) J 

Paras - Ii*ount<CodS, " *> - 1 number of parameters 

IF Perms a * THEN 

Expected information on command line: 
• Config 1Ue, Hamine *, First Doc, Last Doc 

Extract CodS, • \ ".. Strt, SLen extract first pern 
ConfigFUeS = HlDSCCacS. Strt, SLen) ♦ ".CFO" 

Extract CodS, " \ 2, Strt. SLen extract second para 
Machines a BIDKCmcS. Strt. SLen) 

Extract CodS. • \ 3. Strt. SLen extract third para 
lirsta a VALlMIDSCCaeS, Strt; SLen)) 

Extract CodS. ? ", Strt, SLen extract fourth para 
" LastS a VAL(niOS(CecS. Strt, SLen)) 
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ELSE 

PRINT 

PRINT "AIM Prog ran Error: Hi wing Parameters" 

PRINT 

PRIHT 

PRINT "Required Parsae: s j are:" 
PRINT 

PRIHT "AIM Cant ig si la Kachine Nuaber First Doc La»t Doc" 

PRINT 

chine 10 

PRINT "Press the SPACE BAR to exit:" 

1$ = 1NPUTSC1) 

END 

END IF 

IF NOT Ex1stCConf1gMleS) THEN 
chine 10 

PRINT "File ConfigFUeS; " was not found." 
PRINT "Press any key to return to the systea." 
DO: chS = INKEYS: LOOP UNTIL LENCchS) <> 0 
END 

END IF 

OPEN ConfigFiteS FOR INPUT ACCESS READ SHASED AS tfl 

INPUT #1, Fg, Bg r Brdr, LstDirS, DocDirS, NdxDirS, AbstrOirS, LangS 

CLOSE *1 

COLOR Fg, Bg, Brdr 

NomAttr » OneColorXCFg, Bg) 
RevAttr o OneColorXCBg. Fg AND 7) 

FileS * LstDirS ♦ "3" ♦ LangS «• ".LST" 
IF NOT Exist(FileS) TK3* 
chine 10 

PRINT "File "; FilaS; ■ was not found." 
PRINT "Press any key to return to the systeo. " 
DO: chS = INKEYS: LOOP UNTIL LEN(chJ) «> 0 
END 

END IF 

OPEN FileS FOR INPUT ACCESS READ SHARED AS 1 
INPUT 01 , AtListS 

CLOSE J1 

FileS = LstDirS ♦ LangS ♦ ".SEC" 
IF NOT ExistCFIleS) 7H£N 
chime 10 

PRINT "File "; Piles; " was not found." 
PRINT "Press any key to return to the syates. " 
DO: chS = INXEYS; LOOP UNTIL LEN(chS) <> 0 
END 

END IF 

OPEN FileS FOR INPUT ACCESS READ SHARED AS 1 
INPUT n, SecCoce, Art Cod e 

CLOSE 

RESTORE MLGDeta 

RED IN NLGSd TO 8), LENNLGd TO 8) 
FOR i » 1 TO 8 

READ NLfiS(i) 

LEHKLGCi) = L=N(hLGS(i)> 

NEXT 

RESTORE EnglishOata 
RED IN Sections (1 TO 3) 
FOR 1 a 1 TO 3 

READ Sect1on$(i5 

NEXT 



REDIN ArticleSCI TO 3) 
FOR 1 = 1 TO 3 

READ ArticleSCi) 

NEXT 

RESTORE 6eroanData 
REDIN Paragraphs^ TO 
FOR i = 1 TO 4 

READ PeragraphS(i) 

NEXT 

RE0U1 ArtikelSd TO 3> 
FOR i ■ 1 TO 3 

READ ArtikelSCi) 

NEXT 

RESTORE NuaData 

REDIN NusbersSd TO 30) 

FOR' i s 1 TO 30 

READ NuabersSCt) 

NEXT 

IF LangS <> "6ERHAN" THEN 

REDIN BaaeEndlngSd TO 4) 
REDIN NoHae*£nding$(2 TO 6) 
RESTORE HaaeEnd 
FOR i.= 1 TO 4 

READ NaxeEndingSCi) 

NEXT 

RESTORE NoNaneEnd 
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FOR i = 2 70 6 

READ NcttoseEndingS(i) 

NEXT 

END IF 

IF LangS « "GERHAN* THEN 

RESTORE GeroenPrefixes 

ELSE 

RESTORE EnghshPrefixes 

REDIH PrefixesSK TO 9) 
IF LangS a "GERMAN** THEN 
FOR i * 2 TO 9 

READ FirstHalfS, SecondHalfS 

PrefixesS(i) - FirstHalfS ♦ SecondHalfS 

NEXT 

ELSE 

FOR t s 2 TO 9 

READ PrefixMSO) 

NEXT 

END IF 



RED1N HeanPrefixesSO TO 14) 

IF Langs * "GERHAN" THEN 
FOR t = 3 TO 14 

READ FirstHaLfS, SecondHalfS. ThirdHalfS 
HeanPreflxesS(l) a FirstHalfS ♦ Seconcttalfs ♦ ThirdHalfS 

NEXT 

ELSE 

FOR i » 3 TO 14 

READ Heer:PrefixesS(i) 

NEXT 

ENO IF 
END SUB 



SU8 DispHsg (NsgS, p. c) STATIC 

STATIC uindOpen, ScrXO » 1a there already a message displayed? 

SHARED fg, Bg 

IF NsgS = THEN , 

IF UlndOpen THEN GO SUB HsgClosa 

• EXIT SUB 

END IF 

IF UindOpen THEN 

CALL chiroe(9) 

OPEN "DEBUG** FOB OUTPUT AS 10 

PRINT /HO. "WlndODerP - ; UindOpen; HSXS (UindOpen) 

PRINT #10, "rt$gS* | M ; HsgS; •)" • 

PRINT riO, »TRUE=-; TRUE; HEXS(TRUE) 

PRINT #10. "FALSER; FALSE; HEXS (FALSE) 

CLOSE 10 

CALL chiraeCB) 

CLS 

END 

iS - INPUTSC1) 
GOSUB MsgClose 

END IF 



Uid » LEN(MsgS) 

If uid > 50 THEN Uid = 50 

HsgS a BsgS * " ■• mice syre there's a space to find at the end (see below) 

HaxLin a LENCNsgS) \ Uid * 3 
IF RaxLin » 23 THEN HaxLin = 23 
RED IN TextSOtexLin) 
Lin = 0 



Lin = Lin ♦ V ir.er current lin 0 (also eleacnt in text display array) 

Lastspc = Qinstr5r.(Uid ♦ 1. NsgS. " ■>• look for the last space so we can worn wrap 

TextS(Lin) » LEFTS(HsgS, . lastspe - 1) 

NsgS = BlOSOIsgS. lastspc ♦ 1)' remove portion of string that's in tS 
LOOP WHILE LEN(RsgS) > Uid y 

nsgs » RTRins(HsgS) 

IF LEN(RsgS) THEN 

Lin = Lin ♦ 1 
TextS(Lin) = NsgS 

END IF 



yertoargin ■ (25 - Lin) / 2 
IF r «> 0 AND c = 0 THEN 
ULr = r 

ELSE 

ULr = 9*ulr = INTUe'taargln - .3) 

END IF 
OULr « ULr - 1 

LRr o ULr ♦ Lin ♦ 1 ' LRr a 25 - INT(vertsargin) 
DLRr a LRr ♦ 2 

hoHzsargin a (80 - Uid) \ 2 
ULc a horl margin 
DULC = ULc - 3 
LRc a 80 - lNT(hor1js»rc--.> 
IF Uid /2 s uid \ 2 THEN LRc - LRc ♦ 1 
DLRC a LRc ♦ 1 

REDXN ScrX(Array3iteX(DULr, 5uU, DLRr, DLRc» 

CALL 3crnSaveO(DULr, DULx, ZL*r, DLftc, SEG ScrX(O)) 

CALL UindJ1gr(ULr, ULc; LRr, Jc» 4, NomAttr, RtvAttr, •Sta*"**"* 

FOR i a 1 TO Lin . 

CALL OPrintRCCTe-TS:*), ULr ♦> i, ULc * 1, -1) 
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r»Ulr* Lin 

c « ULe 4 1 + LEN(TextS(Lir;> 

IF LEW (Texts (Lin)) ♦ 2 a y- : THEN e = ULe ♦ 1: r = r ♦ 1 

ERASE Texts 
WindOpen - TRUE 

EXIT SUB 

'— = Close Uindow 

MsgC lose: 

CALL ScrnRestO<0ULr, DULe. li^r. DLRc, SE6 ScrXCO)) 
ERASE ScrX 
UiixfOpen o FALSE 
RETURN 

END SUB 

sua oispstat (si) static 

STATIC ROW 

IF Row a 0 THEN 
Row = 15 
Col » 23 
Height = 8 
Hid = 34 

CALL UindHgrCRow, Zz* - 1, Row ♦ Height, Col ♦ uid, 1, NoraAttr, RevAttr, -FH£ r^miCS") 
Col «» Col ♦ 2 

ELSE 

Row ° Row + 1 

END IF 

IF si a *• THEN 
Row = 0 

ELSE 

CALL GPrintRC(sS, fc>-. Col, -1) 

END IF 

IF INSTRUS, "alloc-) = 0 >ES PBINT «, sS 
END SUB 

SUB EosAUoc (NuaPagesX, H^xieS, Load FILED STATIC 

Ecs ALLoc Men NuaPagesX, Hand.eX 
IF EosErrorZ THEN 

PRINT -Couldn't aL.scate-; CLNG<NuaPages) * SixteenK; -bytes of EHS for Load?;^s 

CnlDe c 

DO: LOOP UNTIL LcKCSGYS) « 0 

IS = INPUTS (1) 

END 

END IF 
END SUB 

SUB FindCoabKey (yordEHSi, Nuawordsft; KeyEHSZ, NuaKeyX, CcabFoundEHSX, ftuaCoabfoundX} STATIC 



•ENS ARRAY LEN DESCRIPTION DIRECTION HOC IF I ED? 



!WordH: 32 Document words (Passed) , 

•KeyH: 64 Combined Keywords '(Passed) (Unchanged) 

'CosbFoundH: 64 Coabined Keywords Found (Returned) (Changed) 

DIN wordTeap AS STR32 current docuaent word 

DIN UordCoapare AS STR32 '— current word in CoabKey to check 

DIN KeyTeep AS CosbKeyType entire Coabined Keyword 

IF LangS a "GERMAN- THEN 

REDIH CoBblArray€er(1 TO NuaCoabl Keyword) AS CoabKeyType 
RED IN CoabZArrayOerd TO NuaCeaoZKeyword) A3 CoabKeyType 
REDIN Coab3ArrayO«r(1 TO NuiiCoab3Keyword) AS CoabKeyType 



ELSE 



^SSSSL J?5l! rr " ) ?L C ^' ^5°^' N-^^lKeyword. ComblKeywortJEflSX 
EHS2ARRAY Coeb2Arr«yGer<1), LenCcab, NuoCooh2Keyword, CoabZJCeywonJENSX 
EHS2ARRAY Cc*b3ArrayGerO), UnCoab, NimCoab3Keyword, CortiSKeywcrdEHSX 



REDIH CoafalArrayd TO NuaCoabl Keyword) AS NewCosoType 

REDIN Coab2Array(l TO NuoCoab2Keyword) AS NewCoabType 

REDIH Coab3Array(1 TO NuaCoabSKeyword) . AS NewCoabType 

EJIS2ARRAY CofiblArrayCI), Lenccsb, NuaCoabl Keyword, CoeblKeywordEHSX 

WS2ARRAY Coch2Array(1), LenCcob, NusCoabZKeywerd, Cosb2XeywordEKSX 

ERS2ARRAY Coab3Array(1), LeftConb, NuaCoalgKeyword, CosfaSKeywordEMSX 

END IF 

NufflCoebFound « 0 

SecondEnd ■ NuaCcabZKeywora ♦ NuaCoabl Keyword 
Lenword ° L£N(wordTeap) 
LENKey • LEH (KeyTeep) 

Slash! a -///" 
ASCslash » ASC(V-) 

IF LangS » -ENGLISH- THEN English • TRUE ELSE English = FALSE 

dS n "Narking Coabined Keywords: x out of- ♦ STRS(NuBUordsS) 
x = LEN(dS) - WSTR(dS. V) »«»rww 
DispHsg d$, r, c 

C = C - X - 1 

FOR it a 1 TO NuavordsS' nuaber of words in docuaent 
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QPrintRC HIDSCSTRS(ia), 2), r, c, -1 

'~ 9«t word froa List of parsed document words 
EosGet uordTeop, LtnUord, i&, vordEHS 
' — convert it to a variable- length for speed 
tfordTeapStrS = RTRIHS (UordTeap. Str) 

if It's Engtisn, then make it Lowercase since we iorore ease 
* for Coablned Keywords 
IF English THEN 

Lover uorcTeopStrS 

IF RlGHTSC*>rdTenpStr$, 1) a CHRS(255> THEN 

END IF UOrdT,npStr5 = ^^<«ordTeapStrt, LENCUcrdTe^StrS) - 1) 

END IP 

if it's a valid range, then check words in range 
IF PirstLas«(verdTeapStrS, First, Last, Coeb) THEN 

FOR j a Last TO MrST STEP -1 

IF j <= NUBCOSbl Keyword THEN 

KeyTeapStrS a RTRIH5(CoBblArravGer{j).str) 
SLSEIF j *= SecondEnd THEM 

ELSE Ke y Tea P Str5 = RTRW$(Cc«b2ArrayGeKi - «w>Co*HCe/word).Str) 
EN0 ^KeyTeepStra = RTRlitt(Cc**3ArrayGer(j - SecondEnd). Str) 

ELSE 

IF j « NuaConbl Keyword THEN 

KeyTeopStrS * RTRIflS<Coa*>UrrayCj).Str} 
ELSEIF j <• SecondEnd THEN 
^ KeyTeopStrS » RTRIM(a*b2Array<j - N»oCo*lKeyvord).Str) 

KeyTeopStrS ■ RTBJItS(Coao3Array(j - SecondEnd). Strt 

END IF 

END IF 

words = InCountCKeyTeopStrS, ■ ■) ♦ V count nueber of words 

CALL Extrect^TeapstrS, - 1, Strt, SLen) Street first word 
CurrtCeyS = HIW(J(eyTeapstrS, Strt, SLen)' of coobined keyword 

IF HidCharttCurrKeyS, SLen) = ASCslash THEN 
Exact * TRUE 

CurrKeyS = LEFTS (CurrKeyS, SLen - 1) 
SLen = SLen - 1 

ELSE 

Exact - FALSE 

END IF 

IF SLen < 3 THEM 

CurrKeyS ° CurrKeyS ♦ LEFTS(siashS, 3 - SLen) 
SLen a 3 

END IF 

'coapare first word of coobined key C CurrKeyS] 
•against the current docuoent word CttordTeopStrS] 

IF English THEN 

IF NOT Exact THEN 

CICC , J*?*" <tCASES(CurrKeyS) - LEFTSCUordTeapStrS, SLen)) 
ELSE ' check for * exact* catch 

Natch a CLCASES(CurnCevS) « UordTeopStrS) 

END IF 
ELSE German 

IF NOT Exact THEN 

a* ■ rr^ c ^^r o,6MT -^ rt ' ««» • 

natch « (CurrKeyS = uordTeapStrS) 

END IF 

END IF - • . 

r=^?l!i^^ , ?^LS Bbined key 1n Hrst-Ust range 
IF NOT Natch GOTO SkipCoatKey 

• continue oatching the rest of the words in the coablned key 
' axiting out as soon as there's a non-Batch 

, At Flag a FALSE . . 

Nctrlag s FALSE 

FSR k a 2 TO words' mater of words left in coobined key 

' dract the next word frea the current coobined kevwerd (I) 
CALL Extract<KeyTe«pStrS, - k, Strt, SLen) J 
CurrKeyS a HlBSCKeyTeapStrS, Strt, SLen) 

IP NidCharKCurrKeyS, SLen) a ASUlash THEN 

Exact a TRUE 

CurrKeyS a LEFTS (CurrKeyS, SLen - 1) 
SLen a SLen - 1 

Exact a FALSE 

END IF 

IP SLen * 3 THEN 

CurrKeyS a CurrKeyS ♦ LEFTS(SLashS, 3 - SLen) 
SLen ■ 3 

• END IF 

IF AtFlag - FALSE AND NotPlag a FALSE THEN 

Eaafiet UordCoapare. LenBord, 1ft ♦ k - 1, uordt-r. 
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else * 

IF AtFlag « FALSE AND HotFlag * TRUE THEM 

EmsGet UordCoopare, LenVord, i& + It, Uc~:EM$Z 

IP At Flag » TRUE AND HotFlag ■ FALSE THEN 

EosGet UortCoepare, Lenuord. ift • k - 2, UordEHSX 

ELSE 

1 EmsGet UordConpare, Lenuord, IS • k - <\. UortfEHSX 

END IF 

END IF 

DocUordS a RTRInS{UofdCompa^e.Str> , Docuaent word to compare 
IF English THEM Lover OocUordS 

IF ASCII COirrKeyS) «> ASCat THEN 
IF English THEN 

IF Exact THEM 1 check tor 'exact* notch 

Hatch - (LCASESCCurrKeyS) = OocUordS) 
ELSE ' ui Idcard match, only compare ft of chars in CurrtCeyS 
Hatch = CLCASES<CurrKeyS) = LEFTS<0ocUordS, SLcn» 

t END IF 

ELSE ■— German: no need to use Leases 

IF Exact THEN 1 check for *exact* match 

Hatch = tCurrtCeyS = DocUordS) 
ELSE 1 wildcard match, only compare * of chars in CurrtCeyS 
Hatch = CCurrtCeyS = LEFTS (DocUardS, SLen)) 

END IF 

END IF 

ELSE •— special processing for a wi Idcard 

IF IKSTRCAtLlStS, ■/" ♦ DocttordS ♦ */*) THEN 

Hatch » .TRUE' the word was In the a list, so continue 

ELSE 

IF English THEN 

Hatch 3 FALSE 

ELSE 

Hatch * TRUE 
At F Lag » TRUE 

END IF 

END IF 

IF Hatch THEN 

EmsGet UordConpare, LenUord, i& + k, WordEHtt 
OocUordS = RTRIMS(UerdCoapare.Str) 

IF DocUordS = "not* OR DocUordS = -be" OR OocUordS = ■nicht" THEM Not Flag = TRUE 

END IF 

END IF 

IF HOT Hatch GOTO SktpCombKey 
NEXT' word in current combined keyword 

1= Hatch THEM * this Is a combined keyword, so add it to the list 
NunCocbFound 3 NusCombFound ♦ 1 
IF LangS = "GERMAN" THEN 

IF J «= NuaCoablKeyword THEN 
CodeTesp.Code = CcablArray6erCj).Code 
ELSE IF j « SecondEnd THEN 

CodeTesp.Code ~ Oaob2ArrayGer(J - NuoCoobl Keyword). Code 
ELSE 

CodeTeop.Code » COattArreyGertJ - SecondEnd). Code 
EflD IF 

ELSE 

IF j <- NuaCooM Keyword THEN 
CodeTeop.Code » Combl Array (j). Code 
ELSEIF j <= SecondEnd THEN 

^CodeTeop.Code Conb2Arrey(j - NunCctbl Keyword). Code 

CodeTecp.Code « Comb3Array(j - SecondEnd). Code 
END IF 

END IF 

EosSetlEL CodeTemp, LENCode, NumCoabFound, CcnbFoundEKS*. 

~ — — IF EnsErrorJt THEN '— " probably ran"out~of" storage in EMS 
HuQCoobFound 0 NuaCoobFound - 1 

Era if 



EXIT FOR 

EN3 IF 



SklpCoabJCey: 



NEXT 

END IF' Table ranee was valid 

IF EosErrorX THEN EXIT FOR 

NEXT' key 1n list 

Dispnsg mm , r, c 

IF LangS = "GERIUfl" THEN 

ERASE ComblArrayCer 

ERASE Cc*b2ArroyGe- 

ERASE CoebZArreyGer 

else 

ERASE ComblArray 
ERASE CombZArray 
ERASE Comb2Array 

END IF 
END SUB 

SU3 FlndStngXey CUordEHS=. Kuawordsft, KeyERSZ, Ruaxeyl, SingFoundEHSX, NuaSingFeunaft) STATIC 
*EHS ARRAY LEN DESCRIPTION » DIRECTION H00IFIED? 
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'WdrdH: 52 Oocusent *crds C Passed) (unchanged) 

•KeyH: 32 Single Keywords (Passed) (Unchanged) 

1 SingFoundH: 32 Single Ke..*-ords Found (Returned) (Changed) 

DIN WordTenp AS STR32 current doeuaent word 

DIN KeyTeap AS SingKeyType Single Keyword to be coapared 

LenVdrd a LEN(UordTesp) 

LBOCey » LEMKeyTenp) 

ASCalash » ASCC/") 

RuaSingFound* ■ 0 

dS a -Marking Single Keywords: x out of" ♦ STRS(Nuauordst) 
x • LEN(dS) - INSTR(dS, "■"> 
DispMsg dS, r, c 
e s c - x - 1 

FOR i& = 1 TO IhnUordsa 1 rvaber of words in doeuaent 

OPrintRC LTRUfS(S~S(ie)), r r e, -1 
EmsGet UordTemp, Lenwcrd, 1& r vordEHSX 

PrafixFlag a fal"S=: ReanPrefixFlag a FALSE: UpperCasef lag = FALSE 

wdrdTeapStrS a R?*:sUUordTenp.Str) 

IF RIGHTS (UordTMcStrS, 1) a CHRS(25S) THEM 

uordTeapS^S a LEFTS(wordTtapStrS > LEN (WordTenp St rS) - 1) 

NewSentFlag a TRUE 

ELSE 

NewSentrlag a FALSE 

END IF 

• check If the first 3 letters of the word return 

* a valid range fro* the 3-dioens tonal table array 
TryAgaln: 

IF FirstLastX(LCA55S(uordTeapStrS) # First; Last, Sing) THEN ' yes, so search rtru range 
FOR j » LMT TO First STEP -1 

get the word froa the SINGkey.str list 
EssSeftEl KeyTeap, LENKey, J, KeyENSX 
C_-rKeyS ° RTR INS (KeyTeap. Str) 
S-en = LEMCurrKeyS) 

'— compare the single keyword CCurrKeyS/KeyTeap.Str] 
' egoinst the doeuaent word EwdrdTeap.strJ 
l replaced] IF RI6HTS(CurrKeyS, 1) a THEN 

I? .ItdCharKCurrKeyS, SLen) » ASCslash THEN 
CurrKeyS » LEFTS { Cur rKeyS, SLen - 1) 
e ^ Hatch a (CurrKeyS a RTRIHStWardTeopStrS)) 

Natch a (CurrKeyS - LEWWordTeapStrS, SLen)) 

e%3 If 

I? ^iT^f^^l^^ 1 ^ s1n S l * to list 

NueSingFoundft a NuaSingFounda ♦ 1 
.- CodeTeap.Ccde a KeyTeap. Code 

EnsSet CodeTeap, LEU Code, NuaSingFoundft, SingFaxdEnSX 

IF Em Error* THEN probably ran out of storage in ENS 
^ ^ RuaSingFound = NunSingFcund - 1 

END IF 
EXIT FOR 

E?0 IF 
NEXT* key In range 

ELSE * check for sections 

Utters a LEmtUordTeap.Str, 2) . . 

lP LetterS = "»» OR Letters = THEN 

SecNuaS s VALdllDSCwordTesp.Str, 3)) 
IF SecNuaft <a 3000 THEM 
IF SecttuaS THEN 

IF Letters a then 

a5g CodeTeap.Code a SeeNuaa 4 SecCode '10563 
"end ip CodcTeRpCode 3 ****** * Art Code -13563 
IF CodeTeap. Code « Art Code * 30 THEN 

NuaSingFoundS a NuaSingFouncS ♦ 1 

fT52«2!!l e 5k. l ? ,Cad *- NuaSingFourc*. SingfcurdEMSX 
IF EasCrrorX THEN *— probably ran out of storage in ENS 
^ ^ NuaSingFound a RuaSingFound - ' 

Natch » TRUE 
ELSE *— there's an error 

ErrorFILEX a FREEFILE 

OPEN ■error, txt* FOR APPEND AS ErrorFHSX 

PRINT fErrorFILEX, -Doeuaent File* 

PRINT #ErrorFILEX, "A Code was out of range:"; CodeTea© Code 

PRINT JfErrorFOEG, -UordTeap.Stra'-; UordTeep'str- 

PRINT fErrorFILEX, "SecKus^; Seehua* ' 

PRINT #ErrorFILEX, - 

CLOSE ErrorFILEX 

END IF 

™ ?° *L 17 thCr * *** • nu * ber ♦dUcwIng zs or za 
END IF •— if the secnua <=3000 
END IF this was a zs (section) or z» (section) word 
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END IF' the range was valid 

'for the firs: ucrd try upper-case 
IF (not natch Oft uppercaseFlag) and Lang* <> ■german* then 
IF NewS en t Flag THEN 'try upper-case 

S $ J 8 r V H u ^ E,<UfTJ(UordT ^ St ' $ ' 1,1 * a 

GOTO Try Again 

ELSE 

I? UppertoseFlag THEN 'the word was change already, chec* ce, *e store it 

EosCet UordTemp, Lenword, IS - 1, itordEHSS 
PrevUordS - RTRIHSCUordTecp.Str) 

tF 4 t 1 Sr^S!^f U0rdTeflpStrS ' PrevUordS, L»(CurrKeyS), NofianeFleg) 
IF NoNaaeFleg THEN 'can store the word ^ 

NueSingFounds = NuaSingFcunds ♦ 1 

CddeTeap.Caie = KeyTemp.Code 
m IF &OSet C0deTeap4 L£NCod *' NuaSingFouncS. SingFcundEMSX 

END IF 

UpperCaseFUg = FALSE 

ELSE 

IF FlrstLet » 65 AND FirstLet <- 90 THEN 

NIDS(UordT«pStr$ f 1, 1) . LMSES (LEFTS (WonTeapStrS. 1)) 
Uppercase F Lag a TRUE 
GOTO TryAgain 

END IF 

END IF 

END IF 

END IF 

IF NOT MeonPref txrlog THEN 

'* e i it fep "cmingful prefixes. If found, divide word in two parts 
UordTeapStrS a L«SES(wordTeapStrS) ^™ 
tenu = LSN(uordTenpStrS) 
FOR NuaLet = 14 TO 3 STEP -1 

■ IF LenW > NuaLet ♦ 3 THEN 'should leave at least 3 Utters 

IF IHSTR (NeenPref ixesSCNuoLet) . "\- ♦ LEFTSCUordTeapStrS, Nu=L et > ♦ -v.) them 
UordTeoplS a HI0S< UordTeapStrS, NuaLet ♦ 1) ^ 
UordTeapStrS - LEFTS (UordTeapStrS, NvnLet) 
fteanPrefixFlag = TRUE 

KITTO^ " tetCh ,SaV<i ' toteh wUl cw 3 e for P«rf- 

END IF 

END IF 

NEXT 

IF NeanPrefixFlag THEN GOTO TryAgain 'cheek again 



ELSE 

IF UordTeesn «» mm THEN 
IF PrevHatch THEN 

Linit = 9 ' " 

ELSE 

Llalt a 6 

e» if 

WordTenpStrS a UoroTeepIS 
UordTeoplS a 

ENO IF " te,< l ton,TeflpStrS) ~ L1nit ^ GOTO TryAgain 

END IF 

•check for aeeningies* prefixes and delete It IF LB«VordTessSt rS) >= & 
IF NOT PrefixFlag asD MOT Natch THEN 'only one tloe 

UordTeapStrS - LCASESC UordTeapStrS) 

LenU « LEN( WordTenpStrS) 

FOR NuaLet » 9 TO 2 STEP -1 

" ^ % S ^^A^ HO ,ihouW leave « L«« 3 letters 

U iNm<PrefixesS(ltuaLet), -y ♦ LEFTS (UordTespst rS , NuaLet) - *\"> then 
UordTeapStrS o wds< UordTeapStrS, NuaLeT + 1) 
PrefixFlag a TRUE 

— EXIT- FOR — " 

BO IF 

END IF 
NEXT 

IF PreflKF^ag THEN 
Llait a 6 

END IF V ^ (U0rtfTeapSTr5) *° LiBlt THEN GOTO TryAgain 

END IF 

IF EosErrorX THEN EXIT FOR 
NEXT' word in docuaent 
OispHsg 0, 0 
ENO SUB 

FUNCTION First Lost* (UorcS. FirstX, LaatX, KeyTyneS) STATIC 

returns the starting (F'rst) and ending (Last) range for the word 
' by looking it up in the TebleXC) array 

SHARED SingTabUXO, CoaoT&sieXO 

e a XLateTableXCASCIttCWorsS)) 
b a XLateTableX(NldChar(UcraS f 2)) 
e a XLateTableZlNidCharCJcrsS, 3)) 

IF a = 0 08 b a 0 W c » 0 THEN FirstlastX a 0: EXIT FUNCTION 
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If KeyTypeX » Sing THEN 

First* m singTebleXia. b, c, 1) 
Last* ■ SingTeble'Ca, b. c. 2) 

else 

First* • CoabTable'.Ja, b, c, 1) 
LestX • ConbTableil3, b. c, 2) 

END IF 

Return FALSE if there «s no valid range (I.e.. Flrstt=0> 
FirstLastX « (FiratX <> 0) 

END FUNCTION 

FUNCTION LoadlntoEHS (FileS) STATIC 

' Returns the handle where the file was loaded into — 



ENSPg s EasGetPfSegX 
SizeofFilea = FileSiie&CFiteS) 

NuaPages = sizeofFileS \ S-.xztnK ♦ 2' round off to nearest 2 paoes 
EasAlloe NuaPages, FileEHS. FiltS ™^ 

Nua32kBtocks = SixeofFileft \ Thirty Tuc*B 

Leftover* a SizeofFilea - (fcua32kB locks * ThirtyTuoW) 

FOpenAU Files, 0, 4, LoadFILE 

FOR 1 « 1 TO Nuo32kBlocks - 1 

BoxO K, 10. 18, 7C, 2, RevAttr 
PalntBoxO 14, 10, 13, 70, RevAttr 

QPrlntRC -Loading - ♦ Files ♦ « bloc k* ♦ STRSCi) ♦ ■ /" ♦ STRS<Nua32kB locks * i; . • 

*— nap pages of the ENS neaory to the ENS upper sea oaae frw 

FOR j * 1 TO 2 w ™° * - 

EasHepHea FileEHS, j, CI - 1) * 2 ♦ j 
HEXT ^ ™ EH PMNT EasErrorX: STOP 

'— to beginning of current block 
FSeek Load FILE, <i - 1) • ThirtyTwoKa 

IF DOSErrorZ THEN PRINT "Dos Error:-; VhichErrort: STOP 
IF 1 « Nua32kSlodtt ♦ 1 THEN 

get the 32k block and put it directly into the ENS oaae fr» 
FCetA LoadFILE. BYVAL ENSPg, BYVAL 0, ThirtyfuoS 



ELSE 



IF DOSErrorl THEN PRINT .pea Error:"; ErrorHsgStuhlchfrrorZ): STOP 



*— load the left over (<32k) bytes 

FGetA LoadFILE, BYVAL ENSPg, BYVAL 0, Leftover* 

IF OOSError; THEN PRINT "0oa Error: •; ErrornsgS(UhichError»: stop 

END IF 

NEXT 

FClose LoadFILE 

ClearScrO 14. 10, 18, 70, NoraAttr 

LoadlntoEHS * FileEHS 

END FUNCTION 

FUNCTION NUBS OO STATIC 

HUBS » NIMCSTRSOO, 2) 

END 'FUNCTION . ' ~ ~ 

SUB ReadEnglishText (FlrstLlnea, Lastline*, HandleX, LinX) STATIC 
OW Teap AS STR30' for use In storing 1n ENS 
LENTenp - LEN(Tenp) 

OispHsg "Loading file into aeaory, line 9: \ r, e 

NunLines « 0 • total ruber of lines input froa file 
ActualLin = 0 ! total ruber of lines oinus any blank lines 

■ EndOf Sentences « M7* 

FOR IS e Fi rstlineg TO LastLinea 
SET 99, it, Teep.Str 

replace all quotes with spaces so as not to coaplicate the 

lower casing of the first words of sentences 
Replaeethar Teap. Sir, CHRSC34), » - 
ReadSectlen Teop.Str, SectionSO. ArtfeleSO 
NUBLlnes a NuaLires ♦ 1 
EosSetia Teap, LEsTecp, NuaLines, Handle - 

NEXT 

• Process tear, first aaklng all first letters of sentences lower case 

CurrLine 3 0 . 
GGSU9 GttNextLtae 
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DO 



skip over blank lines, or it we've gone too far 
DO WHILE Start > LenTxt 

IP CurrLine = KusLines GOTO EndOfFile 
GOSU8 6etNextLine 

WOP 

p a InstrTblZXCStart. TxtS, EndOf Sentences)" 
IP p = 0 THEN 

IF CurrLine * NumLines GOTO EndOfFile 
GCSU3 GetNextLine 
P a 0 

ELSE *— check for a NLG (honorific/title) 

Start » p ♦ 2 

R» i « 1 TO USQUKD(KLG$) 

IF LENHLG(t) « p THEN 

IF LCASESOHMCTxtS, p - LEKNLGC1), LENNLGU))) = ML6$(1) THEM 

p ■ 0 

EXIT FOR 

END IF 

END IF 

NEXT 

END IF 

LOOP UNTIL p 

p » p ♦ 2 

IF p ► LenTxt THEN 

IF CurrLire - NunLines THEN 
GOTO EndOfFile 

ELSE 

GCSJ8 GetNextLine 

END IF 

END IF 

CurrChar « NidCharUTxtS. p) 

DO CHILE CurrChar « ASCupperA OA CurrChar > ASCupperZ 
P = P * 1 

IF p > LenTat THEN 

IF CurrLine a KunLine* THEN 
GOTO EndOfFile 

ELSE 

GO SUB GetNextLine 

END IF 

END IF 

CurrChar = nidCharX(TxtS, p) 



HIDS<TxtS, p, 1) a CHftSCCurrChar ♦ 3Z) 

•odd CHRSC255) to t.-»e end of word to Indicate beginning of the sentences 

SpLoc - InstrTbL2Cp, TxtS, ■ ,»3.!?-> 

IP SpLoc <> 0 THEN HIDSCTxtS, SpLOC, 1) a CHRSC255) 

Teap.Str = TxtS 

EmsSetlEL Teap, LSSTeap, CurrLine, Handle 



Start = p + 1 
EndOfFile: 

LOOP UNTIL CurrLine NumLines 

New) a • " 1 replaceoent string for punctuation 
Old* 3 -:/.-()0" 

FOR J a 91 TO 96: OldS = CldS ♦ CHRS(j): NEXT 
LENOld = LEH(GldS) 

FOR 1 = 1 TO NuaLlnes 

EesGetlEl Teep, LENTeap. 1, Handle 
TxtS « RTRim(Teop.Jtr) 

CALL ReaCtrMTxtS, ■")» replace all Ctrl chars with blanks 

' replace only SONS punctuation with spaces 
FOR j • 1 TO LENOld 

CALL ReplflceChar(TxtS, HIDSCOldS, i, 1), Kew$) 

NEXT 

^ = U?^Sr ,, TuS ,37 ' TU " ) >— ,tr '* Punetutatleo ! to Z 

CALL Stripflange<T«tS, 58, 64, TLen) atrip : to a 

TxtS a LEFTSlTxti, TLen) 

CALL StrlpRangelTxtS, 123, 254, TLen) •- strip High chars 
TxtS a LEFTSttxtS. TLen) * 
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CALL Cnmch(TxtS, • TLen)' crunch all multiple spaces to 1 

TxtS « LEFTS (TxtS, TLen) R 

TxtS - LTRIKSCRTR&SCTxtS))* reoove spaces fron left & right 

IF LEN(TxtS) THEN •there's still text there, i.e. it wasn't all 
ActualLin = ActualLin ♦ 1 
Tecp.Str » TxtS 

EesSetlEL Teop, LENTeap. ActualLin, Handle 

END IF 

CALL QJ»rintRC(STM<ActuatLin) # r, c - 4, -1) 

NEXT 
CLOSE fll 

CALL OispHsgC", r, c) 

CAU DIspStatCKuoSCActualLin) ♦ « lines in file.") 

Lin a ActualLin* return the actual number of lines saved 

exit sua 



GetNextLine: 



CurrLine = Currtlr.e ♦ 1 

Start = 1 •— start scanning at first position 
P 3 1 

EasGetlEl Teap, LENTeap, CurrLine. Handle 

'— trio down end of line, but nake sure there's one space at the end 

so that we can find end of sentences (looking for a DOT ft SPACE) 
' even if they're at the end of the line. 

TxtS * RTflimCTeaxs.Str) ♦ ■ ■ 
LenTxt * LEMTxtS) 



RETURN 



SUB ReadGermanText (FirstLineft, LastLineS, Handle*, LinZ) STATIC 

OIH Temp AS STR80' for use in storing in EMS 
LENTeap = LENUeap) 

CALL DispfTsgC-Loading file into aeoory, line ff: % r, c) 

ActualLin a 0' number of lines minus any blank lines 

FOR IS ■ FlrstLlneft TO UstLlneft 

GET #9. 1&, Teap.Str 
TxtS = GPRTrtaSC Teap.Str) 

*— first process the Read Section for German 
ReadSection TxtS, ParagraphSO, ArttkeLSO 

Lower TxtS • — convert all chars to lower case 

ReaCtrl TxtS, •— replace all Ctrl chars with blanks 

'— replace only SOME punctuation with spaces 
hews 

OldS » -:/.-<)C30' 
FOR j = 1 TO LENCOldS) 

CALL ReplaceCharCTxtS, RIDSCOLdS, j. 1) # ftewS) 

NEXT 

CALL StMpRangeCTxtS. 33, 37, Tien)' strip Punctutation ! to X 
_ TxtS =_ LEFTS C TxtS, _ TLen) _ 

1 Note: the range is thru chrSWo) because all the Letters are lower case 
' and all numbers are being stripped out too. We've skipped over 38 
' because it's the & char which is allowed 

' NOTE 2 CVJO: we no longer strip out numbers since we now use the section 
* and article numbers 

CALL StMpRangeCTxtS, 39, 47, TLen) strip 1 to / 

TxtS - LEFTS (TxtS, TLen) 

CALL StripRangeCTxtS, 58, 96, TLen) strip : to * 

TxtS « LEFTSCTxtS, TLen) 

CALL StripRengettxtS, 123,. SS, TLen)' strip High chars 
TxtS a LEFTSCTxtS. TLen) 

CALL Cmnch(TxtS, " \ TLen)'— crunch all multiple spaces to 1 
TxtS = LEFTSCTxtS. TLen) 

TxtS = QFTrimSCTxtS) remove spaces free left ft right 

IF TxtS <> »• THEN ' there's still text there, i.e. it wasn't all soaees 

ActualLin a ActualLin ♦ 1 • 
Tenp.Str = TxtS 

EesSetlEL SE8 Teop, LENTeap, ActualLin, Handle 

END IF 

QPrlntRC STRSCActualLin). r, c - 4, -1 

HEXT 
CLOSE «1 
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OispMsg nu , r, c 

DispStat NunS(ActualLln) ♦ - lines 1n file." 

Lin = ActualLin* return The actual nuzber of lines saved 

END SUB 

SUB ReadSection (TxtS, SecArraySO. ArtArraySO) STATIC 

• Look for -section* - or •articles* 

IF UngS = "GERMAN" THEN 

In German it's Par 
SearchStrS = "Par" 

ELSE 

In English it's Sec 
SearchStrS 3 "Sec" 

END IF 

Letters ■ *zs" 

FOR LookStep = 1 TO 2 

Start = 1 

DO 

m » INSTRCStart, TxtS, SearchStrS) •— colusn of start of Sec. or Art. 
IF b THEN 

j a INSTRCe, TxtS, ■ •) '— position of the end of the word 
X? j THEN '— if this is not a last word 
ELSE W ° rdJ * HIDSCTxtS, o, j - 0) get the whole word 
EXIT do this was the last word, so exit 

END IF 

check if the word Batches variations on Section or Article 

I? LangS = "GERMAN" AND LookStep * 1 THEN 
NuoFound = 4 

ELSE 

NuoFound = 3 «- there are three variations that we check for 

END IF 

IF LookStep s 1 THEN 

CALL FindExact(VARPTJKSecArraySCI)), NuoFound, WordS) 

ass 

CALL f indExact (VARPTR(ArtArrayS(1 ) > , NuoFound, WordS) 

END IF 

If NuoFound <> -1 THEN it did match, so check the natter 
It = j ♦ 1 starting position of [potential: numoer 

OO ' — skip over blank spaces 
eh * NldChartTxtS, k) 
k « k ♦ 1 

LOOP UNTIL ch <> 32 OR k > LENCTxtS) 



DO collect the whole Rusher 

eh - HidCharCTxtS. k> 0 1 - 1) 
«1 » el ♦ 1 

If k ♦ «1 - 1 > LEN (RTRINS (TxtS) ) THEN a1 » B • ♦ 1: EXIT DO 
LOOP UNTIL Ch < ASOO OR ch > ASC9 

IF o1 > 1 THEN 'there is e number 

NUBbS = HIDSCTxtS, k - 1, «1 - 1) 

vALCRuabS} ~ 3000 AND VALCNuabS) >_0_TH£* - 

•If wo picked up one character sore oecause of end of line, delete it 

IF RIGHTSCNuabS, 1) < "0" OR RIGHTS CS;-aeS, 1) > "9" THEN NuobS - LEFTS CNunbS, LENCNusb 

•— if we're looking for Article nuaeera. don't accept 

• article nuabers over 30 

IF LookStep a 2 AND VAL(MuxtbS) > 30 GOTO NextStep 

NewUordS « Letters + Nu&bS 

art » IMSTRCk. TxtS, - ") 

TxtIS 3 LEfTSCTxtS, a - 1) ♦ NewUorcS 

IP ol «> 0 THEN 

TxtS s TxtIS ♦ HIDSCTxtS, al! 

ELSE . 

TxtS a TXtIS " 

END IF 

END IF 

ELSE 

IF SearchStrS ■ "Art* THEN 
al = 0 

loop while it's a Roun nuaeral e?>d we're not 
1 past the end of the string 
DO 

ehS = HIDSCTxtS, k * ol - 1, 1) 
ol a al ♦ 1 

IF k ♦ el - 1 > LEK(RTRIr1S(?i?S)) THEN EXIT DO 
LOOP WHILE INSTRCIVX-, ehS) 

3 NIDSCTxtS, k - 1, Ol - 1) 
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■— translate the fexaan nuaeroU») ts Arabic numerals 
NumFcund = 30 '— there ere 30 Rona.-. nusbers to check 
CALL FindExacKVARPTRtBuabersJCD), i**?cund, NuafcS) 

IF RuaFound <> -1 THEN 

NewUordS a "ra" ♦ NuaSCNuefcwxi * 1) 
ml 2 lNSTH(lt f TxtS, " »> 

IF Hi 3 0 THEN 

Txtt = LEFTSCTxtS, o - 1) ♦ NewUordS . 

ELSE 

TxtS = LEFTSUxtS, a - 1) «- NewUordS + RlGHTSCTxtS, LEN(TxtS) - ml + 1 

END IF 



EKO IF 

EMO IF are we searching for an Article a? 

END IF '— there's a number after the Section/Article 

END IF *— did we find a variation of Section or Article? 

END IF INSTRCText/Sec.*) was found 

NextStep: ' 

Start = a ♦ 1 

LOOP UNTIL a = 0 

SearchStrS ■ "Art" 
Letters = "za" 



'— start looking at the beginning of the line 
Start • 1 



1 — look for the section symbol 
a « INSTRCStart, TxtS, CKRSC21)) 

if we found one, process it 
IF a THEN 

1 — position right after the symbol 

k = a * 1 

■1=0 

*— loop until it's not a number (a space is ok, however) 

1 or we've reached the end of the string 

DO 

eh = HidCharCTxtS, k ♦ a1) 
■1 » eft ♦ 1 

IF k ♦ art - 1 > LENCRTRUtSUxtS)) THEN EXIT DO 
LOOP UNTIL (Ch < A5C0 OR Ch > A5C9) AND ch <> 32 

the number 1s the position f roa right after the symbol (k) 
* to the ncn-nuaber position found in the loop above (a1 - 1) 
NumbS =» QrTrimSCniDSCTxtS, k, n1 - 1)) 

IF VALCNusbS) <- 3000 AND VAL(NuflbS) > 0 ?n£N 

NewUordS = "zs" ♦ NumbS 

a1 a INSTRCk ♦ 1, TxtS, ■ ?) 

IF n1 THEN 

TxtS = LEFTSCTxtS, a - 1) ♦ NewUordS ♦ HIDSCTxtS. el) 

ELSE 

^ TxtS a LEFTSCTxtS, a - 1) ♦ NewUordS 



END IF 

* — start looking at the next position 
Start 3a*1 



loop until we don't find any more section sycbols 
loop until a « 0 

END sua 

SUB UindHgr (ULRow. ULCol, LRRow, LRCol, Frame, BoxCoir, Text Coir, Texts ) STATIC 

CALL BoxOCULRow - 1, ULCol - 1, LRRow ♦ 1, LRCol ♦ 1, Frame, BoxCoir) 

CALL ClearSerOCULRow, ULCol, LRRow, LRCol, BoxCoir) 

CALL QPrintRCCC" ♦ TextS ♦ "J", ULRow - 1, ULCol ♦ 1. TextColr) 

ENO SUB 

SUB UordParse UextHendle, Lines, UordHandle, UordsS) STATIC 

0U1 Temp80 AS STRSO' for retrieving lines of text froa the file . 
OIN Teap32 AS STR32' for saving Uorda in the word list 

LENTempBO - LEN(TempBO) 
LENTeBp32 « LEN(Teop32) 

UordsS a 0 

Oa DispflsgCParsing line »: word ff: r. c) 
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FOR i ■ 1 TO Line* 



QPrintRC STRS(i), r. c - 18, -1 

EesGetlEl TeapSO, LENTecpSO, i, TextHendle 

tS • RTRIHSCTeapBO.Str) 

TotU a inCountZCtS, ■ ") 4 V nuaber of words in current line 
FOR Uord = 1 TO TotU 

CALL ExxractCtS, * word. Start, SLen) 
IF SLen > 0 THEM * 

uS = HIDSCtS. Start, SLen) 
'If Were Is ehr(255>. I.e. this Is the first word In the sentence extract it 
SecondPart: 

IF Ung* * "ENGLISH" THEN 

n « INSTRCwS. CHRSC253)) 
IF n THEM 

vIS » !1I0S(vS, n ♦ 1) 



wS » LEFTSCwS, n) 
ULen = LENCwS) 

'put "/" before Barter (chrS(24)) 
If ULen < 4 THEN 

wS « LEFTSCwS, VLen - 1) ♦ STRINGS (4 - ULen, */") + CHRS(255) 
ELSEIP MDSCwS. WLen - 2, 2) = •'s" THEN wS = LEFTS CvS. ULen - 3) ♦ CHRSC2SS) 
ELSE1F FUDSCwS, VLen -1,1) = THEN US = LEFTSCwS. ULen - 2) ♦ CHRSC2S5) 
EJO IF 

END IF 

END IF 

'fill out 1 and 2 char words with /'a 
SLtn = LENCwS) 

IF SLen < 3 THEN wS = wS * STRINGSC3 - SLen, »/") 

' allow only words that start with alphabetic chars "a"-**- 

ASCw s ASC(wS) 

I? (ASCw « ASCa AND ASCw «= ASC*> OR (ASCw >= ASCupperA AND ASCw «= ASCupperZ) THEN 
Uordsft = Vordsft ♦ 1 
•OPMntRC STRS(Uordsft), r, e - 5, -1 
IF LangS =» "ENGLISH" THEN 

'the following doesn't apply to Geraan ' 
IF RIGHTSCwS, 2) a "'s* — 

wS a LEFTSCwS, SLen - 2) 
ELSEIF RIGHTS (wS, 1) » THEM 
. wS = LEFTSCwS, SLen - 1) 

END IF 



END IP 

'store the word in ENS now 
Teop32.Str 3 wS 

EBsSet Tecp32, LEHTenp32, Uordsft, UordHandle 

IF Ens Error! THEN '— probably ran out of storage in E*S 
verdsg = Uordsft - 1 
EXIT FOR 

END IF 

ESD IF 

IF wis <> mm THEN wS = wIS; wis = GOTO SecondPart 
END IF 

NEXT 1 word in current line 

IF EosErrorX THEM EXIT FOR 

NEXT' line of text 
CALL DlsphsgC"", r, c) 

call 01spStar(LTRIHS(STRSCuords&» ♦ " words were found.") 
END SUB 

SU8 UritelOFText STATIC 



—, load~in Code-^^rd dictionary directly into BIS 
•LoedFiLeS - LstDirS ♦ "DICT.VRO* 
'DictUordNun - FHeSixeftlLoadFileS) \ SixtyFour 

•CALL DispHsgCLoading - * Nuitf (DictUordMuo) ♦ ■ Dictionary Entries", r, c) 

'DIM DietUordTeop AS Diet Type 

'KustPages • DictUordNua \ 256 ♦ 1 

'CALL EaaAllocfleaCNuoPages, OictvdrdEnSX) 

'IF EasErrorX THEN PRINT 'Couldn't allocate"; NuaPages * Sixteen*.; "bytes of ENS.": STOP 
'CALL rOpenCLoadFileS. DIctFILEX) 

'FOR i ■ 1 TO OictVordhua . - 

• CALL FGetTCDictFlLd, DictUordTesp, LEN(OictUdrdTesp)) 

• CALL EaoSetlEMSEG OictVordTecp, LENC DietUordTeop), 1, Di ctVordEHSX) 
•NEXT 

'CALL FCloseCDictFlLEX) 
•CALL OispNsgCNulS, 0, 0) 

'CALL DispHsgCSeving IDF. TXT and DCCXEYS.TXT for extinction. 1 , r, c> 
"save arrays for examination later 
•OPEN LstDirS ♦ ■IDF.TXT*' FOR OUTPUT AS 1 
•FOR i = 1 TO UBOUNDCIDF) 

• CALL EBsGetlElCSEG DietUordTeop, LENCDictVwtJTeap). 1, DictUordEHSX) 
' PRINT n, DictttdrdTeao.Str; IDF CD 

'NEXT 
•CLOSE ffl 

•open LstDirS ♦ "oocxErs.ar FOR OUTPUT AS 1 

•CALL FOpcnCLSTDlrS ♦ -DOCKETS. AH" , OocKeysHX) 
•FOR 1ft a 1 TO CurrOociCeyl 
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CALL FGetRT(DocKeysiO.. OocKeyTemp, ift, LENDocKeyTenp) 

CALL EmsGetlEKSEG Di ct UordTeap, LEH(DictUordTenp), DocKeyTeap.Code, DictUordEJir:) 
PRINT *1, DictVordTeee. Sir; DocKeyTeap. Freq 
NEXT 

CALL FCloseCOocKoysHT;) 
CLOSE n 

ERASE IDF 

CALL EosRelHen(Dictuw»dE?iS2) 
CALL OlspHsgC"- , 0. 0) 
CALL ChineCtO) 
END SUB 

SUB ZZZJtevisionlog STATIC 
SLog: D;/aio/aio.ba,/ S 

— Rev 2.10 06 Feis 1992 13;30:14 
EChanged by THY:3 

* Conf 19 Modified: Returns the no chine M and First /Last document 
Nuaber to process Centered on the command line} so that only one 
AIM. EXE is needed instead of one for each as chine. 

* CMA1H3 hodified: Changed when writing out any aachine specific file 
(IDF. SIN, STATS.TXT, etc.) to use the aachine nuaber instead of a 
hard-coded nuaber. 

* FindSingKey Hodified: Changed the check for a final slash (/) froa 
using RIGHTS (Cur r<eyS. 1) to use RIDCHARX( CurrKeyS, SLen)-ASC(V"K 

* LosdlntoEHS Added: loads a file quickly into EKS. 

* EHSALloc Added: used by LoadlntoEHS — allocated EMS and displays an 
error if insufficient EJ1S space. 

* Fixed bug where Start variable was COHHON SHARED and was used in 
multiple procedures which affected one another. 

* Document Files (*DOC.TXT and *0OC.KDX) and configuration <*.CFG, 
C LANGUAGE] . LST3 are now opened ACCESS REA0 SHARED. 

— Cchanged by VN:3 

* Read Section Added: looks for references to Section or Article 
• nunbers and "normalizes" their appearance by adding a ZS prefix for 

Sections and ZA for articles to the nuaber (article numbers are 
converted from Rcsan numerals to Arabic) so that they can easily be 
recognized by the FindSingKey routine. 

* ResdEnglishText Modified: added call to ReadSection before any 
parsing or stripping of characters is done. 

* FindSingKey Hodified: added check for section numbers (denoted by a 
ZS prefix to 0 nuooer) and article nunbers (ZA prefix) which returns 
a code of the Section number plus 10.563 (10,563 is the nuaber of 
dictionary entries of words prior to the addition of the section and 
article numbers) or plus 13,563 for Article numbers. 

* ReadEngllshText Hodified: Change the StripRange code so that r&abers 
would NOT be stripped out. CTHY Bug Fix:] Added in the 
TxtS = LEFTSCTxtS, Tien) code which was left out (It's required 
since StripRange doesn't change the length of a string) and caused 
problems because the string could contain garbage at the end. 



Rev 2.9 24 Jan 1992 17:33:52 



final 3- letter table version as of 7/1/91 by Ted 11. Young 
uses the KEYWORD. TBL ft KEYCCH3.TBL 3- letter indexes 



Rev 2.8 24 Jan 1991 10:43:38 

Changed document loading to use the ISAHed text file (i.e., a single ' 
large file instead of many individual files). 

EHSGet/Set are used instead of the paged E*SGetlEl/Set1£l access using 
the GetDocKey/SetDoeKey SU8s. 

CDHBXEY.STR & SINGKEY.STR files are loaded directly into EHS since 
they are to large to load Into normal memory. 

Rev 2.7 01 Aug 1990 15:06:16 

Renaaed CheckCombKey to FindCombKey to be more consistent with naming. 

Fixed bug with rcn-wUdeard words (i.e., those that end with a slash 
"/-) where it would try and compare the word in the document with the . 
word in the keyword list, but without removing the slash froa the 
keyword, so the comparison would never be true. 

Added feature where the FirdGombKey would test for a slash for each 
word in the coobined keyword. 
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'— Rev 2.6 26 Jul 1990 16:29:52 . 

' — Fixed bug where it wouldn't read the correct information from the 

configuration (.CFG) file. le wasn't reading the NdxDirS, so it would 
1 — atop with a bad file naae error. 



Rev 2.5 25 Jul 1990 18:20:28 

. Fixed bug: Sometimes the LineCount routine will return too carry lines, 
so a check for EOF was put in before reading each line. 



Rev 2.4 15 Jul 1990 13:14:56 
'— Changed CONFIG so that it can read a configuration File*" named on 
the command line, but will still default to reading AM. CFG. 



Rev 2.3 13 Jul 1990 11:22:56 

'— 8C/PD3 7.00 Compatible, use of static arrays in typed storage 
' — for OocKevs 

END SUB. 



AIKPASS2.BAS 



Invokation: AIHPASS2 ConfigFile 



Creates: Key. Ndx, Weight. Ndx, \Aio\CCo«oand$3.0at 



'U*es: Dict.Wrd, Idf.Sin. Ooelndex.Ah, DocKeys.Ah 

1 nth Rec of Key. Ndx contains Number of Keywords in Document 
followed by 127 codes 

' TYPE JCey*dx127 

Nua AS INTEGER 

CodeCI TO 127) AS INTEGER 

1 nth Rec of Weight. Ndx contains 127 saltan Weights 
computed with the formula below * 

' TYPE VeightNdx 

. Ueight<1 TO 127) AS SINGLE 

SALTON WEIGHT FORMULA 



tog2(Freqinfioc*1)»Log2((TotDocs*1 .5)/Dc<sWlthVc^3+TotDoc^.001) 

'UeightCUord)" — 

' Log2 (2 ♦ TotalKWordsInOoc / 10) 

DEFINT A-2 

•SMOJUDE: ■ \\VAOIN\C-DR1VE\USER\INCUJOE\TYPES.BI • 

TYPE WeightCoda 

Code AS INTEGER 
Wt AS SINGLE 

END TYPE 

TYPE Flen 

Str AS STRING * 12 

END TYPE 



TYPE DoclndeaType 

Ndx AS LONG 

Nua AS INTEGER 

Tot AS LONG 

Padding AS STRING * 6 

END TYPE 

TYPE Sea I lOoc Index Type 
Ndx AS LONG 
Nua AS INTEGER . 
Tot AS LONG 

END TYPE 

TYPE STR49 ... 
Str AS STRING • 49 

END TYPE 

TYPE OocKeyTypo 

Code AS INTEGER 
Freq AS INTEGER 

END TYPE 

DECLARE SUB Canflg OlaeMneS. First, Last) 
DECLARE SUB Dispnsg (HscS. rt, a) 

DECLARE SUB Wlntttgr (ULRswS, ULCotf, LRRowZ, Lfitotl, FraaeX, BoxColrX, Text Co I rX, Texts) 
DECLARE FUNCTION LoadlntefilSX (FlleS) 
DECLARE FUNCTION Log2! Cx!) 
DECLARE FUNCTION NuaS <*> 

DECLARE FUNCTION EosS (Handle*, EleaentX. SiieX) 
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• External routines 



•JXNCLUDE: , \\VA0IH\C-ORXV£\USER\IMaU0E\DECLWlES.BI» 



! — PROGRAM START 

CONST Sing » 0, Comb « 1 

COMMON SHARED ReProcess, Fg, 8g, NonsAttr, RevAttr 

COMMON SHARED DocOirS, KeyDirS, LatDirS, AtListS, LanqS. RdxDl rS 

COMMON SHARED XleteTableT.O, SingTableXO, Ccx*TableZO 

COMMON SHARED ThirtyTwo, SixtyFour, SixteenK. Thirty Twott 

COMMON SHARED Log2Const!, Threshold! 

COMMON SHARED Ascend, Descend, FALSE, TRUE 

COMMON SHARED Enter, Escape 

'SSSfceyword-flagSSS "lest Author: Xw Revision: Iv Date: If* 

CONST Versions * "Last Author: TED Revision: 18 Date: 8-S«p-92, 18:31:06" 

Log2 Const! ■ LQSC2!) 

Sixteen* = 16384 

ThlrtyTwoKft = 32768 

SixtyFour = 64 

Thirty Two = 32 

Ascend = 0 

Descend = 1 

FALSE = 0 , 
TRUE « HOT FALSE 
Enter = 13 
- 27 



ThresholdVord - 80 

DIM Avgtfeight AS SINGLE average of the Thresholdword keywords* weights 



Config Machines, FirstDoe, Last Doc 

0»rintRC - AIMPASS2 Started At ■ ♦ TIMES, 1, 25, -1 
Col = (60 - LEN(VersiooS)) \ 2 
QPrintRC Versions, 2, Col, -1 
LOCATE 4, 1 

»— load in Code— >Uord dictionary directly into EMS 
LoadFileS = LstDirs ♦ "DICT.URD" 
OlH otctvrdTeap AS Diet Type 

DleturdNua a FlleSl2e&( LoadFileS) \ LEN(DleturdTeap) 
DIctUrdEMS = Load IntoEMSC LoadFileS) 

load the IDF. SIN fUe 
LoadFileS « LstDirS ♦ --3P.SIN" 
IDPNub = FileSizeftCLoatiFHeS) \ 4 
PRINT "Loading "; LoadFileS; IDFKuo 
RED IN IDF! CI TO IDPKuo) 
REDIH IDFTenpd TO IDFNua) AS LONG 

FGetAH LoadFileS, SEO ID?Tcap<1), LEN< IDFTeapd } ) , IDFNua 

'— Store Average Doc Frequency in -CC0MMAMDS>. DAT file in LST directory 
FOR i • 1 TO IDFNua 

IF IDFTeapCi) > 0 THEN 

SunOfAllrrecA ° SueOfAllFreqft ♦ IDFTenpd) 

Nuottorosvaed » NusUordsUsed ♦ 1 

END IF 

NEXT 

AvgDoefreq! = SuaOfAllFreoft / KuawordsUsed 

OPEN KdxDirS ♦ "AVSDOCFC.DAT" FOR OUTPUT AS 87 
PRINT 07, SOR( AvgDoefreq! ) 

close trr 

LoadFileS s_LstDlrS_t_ B DCCIhD£X.AH"_ — 

DIM OodtdxTeep AS DocIncexType 

Counts » FIleSlze&CLoadFHeS) \ LEN ( DocNdxTeap) 

PRINT "Loading "; LoadFileS; Count* 

DIM DocNdx AS DocIncexType 

LenDocKdx 3 L£N( DocNdx) 

FOpenAU LoadFileS, 0, 4, LoadHand 

LoadFileS = LstOirS «■ "DOCKETS. AH" 
KusDodCeysS ■ F1lcSi«a( LoadFileS) \ 4 
PRINT "Opening LoadFileS; Nu*Dodteys8 

CALL FC^enAUCLoadFUeS, 0, 4, DocKeysHXV open up the file for use later with FGetRT s 
DIM DodCeyTeap AS OodCeyType 
OodCeylEM - LBUDodCeyTeap) 



CALL D1spf!sgC"2nd Pass: Calculating Invert* Docuaent Frequencies", r, C) 

Countlog! 3 Log21(CSKG(Count&)) 

Log2of5l « Log2K5!) 
Aft » FR£("") 

wordUseS = STRINGSCDiettfrdNua \ 8 ♦ 1, 0)' setup and clear bit array — indicates vhicn words \ 

HaxpF! a CLKSCCountft) • 1.5 •"TotOccs * 1.5" 

OF ADD! • 3 ♦ .001 • count* •"DoeFreq ♦ 3 + (Tot Docs • 0.001)- 

FOR word a 1 TO DictVrdKua. * "* 

If IOFTeap(werd) > O THEN 

IDFUvord) - Log2! (MaxDF! / ( (OFTeapCVord) ♦ 10FADDD) 

EN D IF • 
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NEXT 

ERASE XDFTeep 

CALL DispHsgC", 0. 0) 

CALL 0lspft*g("2nd Pass: Calculating Formula Weights: r, c> 

OPEN LstOirS ♦ *Weighxs.Lst w FOR OUTPUT AS 1 

Din Keyttdxftec AS KeyNdxl27 

DIN WeightNdxRec AS WeightNdx127 

OPEN HdxDirJ ♦ "KEY. « OX" FOR RANDOM A3 03 1EN = LEN(KeyHdxRec) 
OPEN NdxOirS ♦ "WEIGHT. NDX" FOR RANDOM AS W LEN » LEN( WeightNdxRec) 

TotDocLenS ° 0 

AvgWeight • 0 

Start* = 1 

FOR DocNumft = Start& TO Count* 

FGetRT LoadHand, DoeNdx, DccNum&, LenDocNdx 
AS = STRS(DocNunS) 

TotOoclen& = TotDocLenS ♦ DocNdx.Tot 

output every docuosnt to the WEIGHTS. LST File 

•PRINT #1^ AS; tot Freq: n : OocNdx Tot 

•PRINT n. USING -\ \ \ \ \ \ code.tf - F req»; "DocFr"; -Weight-; *»Ke,vonr 

KeyNdxRec.Num = C 



IF NdxNum = 0 THEN 

PRINT tfl. -no Keywords for document AS 

• GOTO Pass2Skip 

. END IP 

CALL QPrintRC<ST«(DocNumg), r, C - 6, -1> 
RED IN Weight (1 TO HdxNuo) AS Weight Code 

• total number of tero appearances in this document. I.e., 1f 2 teres 

• appear 10 times each, then this number would be 20, as opposed to 

• -Hub which would be 2 in this ease. 
Tot! = DocNdx.Tot 

IF Tot! « 0 THEN 
Chime 2 

Tot! » ASS(Toti) 

END IF 

IF Tot! = 0 THEN * shouldn't happen! 
CALL ChiseMO) 

PRINT SI, "There were no Tero appearances in document AS 
PRINT tfl, "However, there were"; NdxNum; "keywords." 
IS a INPUTS (1} 
GOTO Done 

END IF 

Totlog2! = Log2!(2! ♦ Tot! / 10!) "logCCTotwords in 0oc/10)*2J 

FOR Word = 1 TO NdxNum 

' Index* o vord ♦ DocNdx.Rdx - 1 

• FGetRT DocKeysHX, DodCeyTemp, IndexS, OocKeyLEN 

Freq! = DocKeyTemp.Preq' frequency of word in current document 

' — account for integers » 3276? which went negative 

IF Freq! « 0 THEN 

^ Freq! * Freq! ♦ 65536 

" Tf^qi " Freq! ^1 

Code « DocKeyTeop. Code 
weight (Uord). Code * Code 

Ueight(word>.Ut » Log2!(Freq!> * IDF! (Code) / TotLog2! 

•PRINT *1, USING "MM #M*fl 0.0AM SUmt Freo' - 1- lDFT«nnf • 
|<£jt| ' ^* '* weight (Word), wt; code; EasJ<CHctVrdQ!SZ, 

•sort in descending order according to weight 
'so the keys ore output in decreasing iuportance 

SortT UeightO), IftfaNuo, Descend, LEN( Weight CD), LFJKWeighKD.Crte). -3 

•— only add if there are ThresholdWord or more words 
' otherwise we'd be adding zero 
IF NdxNum *= ThresholdUord THEN 

Avfltfeight o AvgWeight 4 weight (ThresholdWord) Wt 

END IF 

FOR k ° 1 TO 127 

WelghtNdxRec.Weight(k) = 0 

NEXT 

FOR Word = 1 TO NdxNua 

IF We1ght(word).Wt > Threshold! THEN 
IF KeyNdxRec.Nus < 127 THEN 

Code a weight (Word). Cod* 
CALL Seteit(UordUseS, Code, 1) 
KeyKdxRec.Nuo a KeyNdxRec.Nua ♦ 1 
KeyNd^olec.Code(lCevNdxRee.Nua) ° code 



Code, 
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WeightNdxftec.UelghKKeyNdxRec.Kua) = WeightCWordKv: 

IP WeightNcUftec.ueighiCKeyNdxAec.Nua) < Threshold! Th=H STOP 

ELSE 

Hu30ver1278 = Kuo0ver127S ♦ 1 
KeyNdxfiec.Nun ■ KeytWxFtec.Kuo ♦ 1 
WeightOver127l » Weight0ver127! ♦ Ueight(Ucrd).vt 

END IF 

'MINT 01, USISS "Wff: 8"; KeyWxRec.Nua; Weight (Word J. tft; EasS(D1ctwrdE«SX, Weight (Word). Code, LENCDie 



NEXT 
Pass2Skxp: 



PUT «, DocNubK, <ayNdxRec 
WIT #A, DocNuni, -eightNdxaec 

Totk& = Totfc* ♦ (CeyNdxRec.Nua 

IF INKEYS = CHaSJ27) THEN EXIT FOR 



'— Calculate Threshcld based on avg of the ThresholdWord keyword's Sal ton Weight 
Threshold! - AvgWeight /• Count* 

•— no* go through the ke/.ndx and change the key count CKeyKdx.Nua) so 
• thot only the words trat have weight >» the threshold will be used 

FOR DocNuaft ■ Startft TO CountS 

GET 03, DocNunS, <eyNdxRec 
GET 04, DocKunft. -eightNdxRec 

' — look through the weights until we find one that's lower than 

' the threshold 

NevNdxNun = 0 

FOR i = 1 TO KeyNcaRec.Nua 

IF VeightNdxftec. Weight <i) * Threshold! THEN 

this one is lower than the threshold, so the previous 
• word should be the last keyword for this document 
Ne^NdxNue = i - 1 
EXIT FOR 

END IF 

NEXT 

IF NewNdxNua. > 0 THEN save new Ndx Nua keyword count 
KeyNdxRee.Nua = NewNdxNun 
PUT 03, DceNuaJ, KeyNdxRec 

END IP 



CALL DispNsgC"", 0, 0) 

CLOSE 3, 4* close the random-access key coded file 

CALL FClose(DocXeysMX) 
CALL EasReLHea<DictWrd81K) 

PRINT 01, "Total keys written:"; Totkft 

PRINT #1, "Average 0 keys/record:"; Totka / CountS 

PRINT 01, "Average Document Length:"; TotDocLenS / Count* 

PRINT 01, "• 

PRINT 01, "Threshold:"; Threshold! 

PRINT 01, "* 

WordalnUseX = 0 

FOR i a 1 TO DictWrdXua 

WordsInuseX * UcrcsInUseX - GetBitttWordUaeS, i> 

KEXT 

PRINT 01, "Words used in collection:"; WordsInUseZ; NuaWordsusad 
PRINT 01 / "Average Document" Frequency of words used:"; AvgDoeFreq! — " 

PRINT 01, "" 

PRINT 01, "Ituaber of docuaenta having aore than 127 keywords:"; Nua0ver127S 



CLOSE 1" close the welghts.Lsr f ile 
Chine 10 

SU8 Config (Machine*, First, Last) STATIC 

CedS s CPTrimS (COMMANDS) 

IP COMMANDS <> *• THEM 

ConfigFileS o COMMANDS ♦ ".CFG" . 

ELSE 

Chine 10 

PRINT "No database specified." 
END 

END IF 

IF NOT Exist (ConfigFileS) THEM 
Chlee 10 

PRINT "File-; ConflgrUeS; - was not found • 
END 
END IF 

OPEN ConfigFileS FOR INPUT AS 01 

.INPUT 01, Fg, Big, Brdr, UtOirS. DocOlrS. NdxDtrS, AbstroirS, LangS, Thresholdl 
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COLOR Fg, Bg, Brdr: OS 

KcrmAttr = OneColorZCFg, 5g) 
RevAxtr s OneCclorXCBg, Fg AND 7) 

IF NOT Ess Load edZ THEN 
PRINT "No ENSi" 
BEEP 
END 

END IF 



SUB DispHsg (HsgS, r. C) STATIC 

STATIC UindOpen, SerZO ' is th»re already a Bessage displayed 1 ' 
SHARED fg, Bg 

IF Msgs * — AND UindOpen THEN GOSUB HsgClose: EXIT sua 
IF UindOpen THEM 

CALL Chioe(9) 

CPEH "DEBUG" FOR OUTPUT AS 9 
PRINT 09, "UindOpen^ UindOpen; HEXS (UindOpen) 
PRINT 09, "HsgSs j"; HsgS; ■J" 
PRINT 09, "TRUE=\- TRUE; HEXS (TRUE) 
PRINT 09, "FALS£="; FALSE; HEXSCFALSE) 
CLOSE 9 
CALL Chioe(8) 

as 

END 

iS « INPUTS CD 

END If 

Uid * LENCHsgS) . - 

IF Hid > 50 THEN Uid = SO 

HsgS * HsgS ♦ " cake sure there's a space to find at the end (see below) 

HaxLin a LEN(HsgS) \ Uid * 3 
IF HaxLin > 23 THEN HaxLin = 23 
REOIH TextS(HaxLin) 
Lin » 0 
DO 

Lin « Lin ♦ V incr current lin 0 (also eleaent In text dlspLay array) 
LastSpe ■ OlratrBWd ♦ 1, HsgS, - ■)• look for the last space so w» can word -rap 
TextS(Lin) * LE?TS(KsgS. LastSpe - 1) ^ 
HsgS o HIDSCHsgS, LastSpe + 1)' remove portion of string that's in tS 
LOOP WHILE LEH(RsgS) > Uid 

HsgS = RTRINS(HsgS) 
IF LEN(NsgS) THEN 

Lin « Lin ♦ 1 

Texts (Lin) = Msg$ 

END IF 

IF r <> 0 AND C a 0 THEN 
ULrs r 

ELSE 

ULT o 9 

END IF 

DULr = ULr - 1 
LRr a ULr ♦ Lin ♦ 1 
DLRr = LRr * 2 

horizmargin o (80 - Uid) \ 2 
ULc » horizfiargir. 
DULC ° ULc - 3 
LRc = 80 - INT(horiziaargin) 
IF Uid /2 s Uid \ 2 THEN LRc ■ LRc 1 
DLRe ■ LRc ♦ 1 



REDIH ScrJUrrnySireJCDUl-, 0ULc f DLRr, OLRc)) 

CALL ScrnSaveO(DULr, DULc, DLRr, DLfic, SEB ScrX(O)) 

CALL UindHgr(ULr, ULc, LRr, LRc, 4, HcrmAttr, RevAttr. •Status") 

FOR i « 1 TO Lin 

CALL OPrintRC(Tex-JCi), ULr ♦ 1, ULc + 1, -1) 

NEXT 

r a ULr + Lin 

C a ULc ♦ 1 + LEN(TextS(Lin» 

IF LEN ( TextS (Lin)) ♦2sUidTHENC = Ux*1:r = r*1 

ERASE TextS 
UindOpen ■ TRUE 

EXIT SUB 

' ' — ' ~— — — — ctose window 

HsgClose: 

CALL ScrnRestQCDULr, DULc, DLRr, OLRc. SEG ScrZ(O)) 
ERASE ScrX 
UindOpen - FALSE 



FUNCTION EbsS (Handle, Element, Site) STATIC 
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Dlfl Tenp32 AS STR32 
DIM Teap49 AS STR49 . 

IF SizeX = 32 THEN 

CALL EnsGetlELCScS Tenp32, Size, Eleaent, Handle) 

EnsS a R7HimCLE?T5(T«p32.Str, 30)) 
ELSE IF SizeX = 49 THEN 

CALL EnsGet1EUSE3 Twp49, Size, Eleeem, Handle) 

EnsS = RTRIMS<LEF:sCTenp49.Str. 47)) 

ELSE 

STOP 

END IF 

END FUNCTION 

FUNCTION LoadlntoEHS (PileS) STATIC 

, — Return* the handle where the file was loaded into 

EMSPg = EasGetPFSegX 

Sizaof File& = PileSizeKFi leS) 

NunPages = Sizeof Filet \ SixteenK ♦ 2 1 round off to nearest 2 pages 
EosAUocKea NuaPages, Fi leEHS 

Ni»32kB locks = SizeofFHei \ Thirty TuoKS 

Leftover* a SizeofFiLei - CNua62kB locks * ThirtyTuott) 

FqpenAll Files, 0, 4. LoadFile 

KM 1 = 1 TO Nua32kBlocks ♦ 1 

BooO 14, 10, 16, 70. 2, RevAttr 
PaintBoxO 14, 10, 18, 70, RevAttr 

GPMntRC "Loading ■ + Files ♦ * block" * STOCO ♦ " /* ♦ STRSCNu»J2kB locks ♦ 1) ♦ » 
' — nap pages of the ENS seaory to the ENS upper eea page fraee 

f OR j a 1 TO 2 

Easftapnen FiteEftS, j, (i - 1) * 2 + j 

IF EasErrcrX THEN PRINT "Eos error:"; EasErrorX: STOP 

NEXT 

seek to beginning of current block 
FSeek LoadFile, <i - 1) * ThirtyTwc*3 

IF OOSErrorX THEN PRINT "Oos Error: 0 ; WnchErrorX: STOP 

IF i < Nua32kBlocko ♦ 1 THEN 

'— S«t the 32k block and put it directly into the ENS page frame 
FSetA LoadFile, BYVAL EHSPg, BYVAL 0, TbirtyTwott 

IF OOSErrorX THEM PRINT "Dos Error:"; Errorfteg$(«hichError2) : STOP 

ELSE 

load the left over C<32k) bytes 
FGetA LoadFile, BYVAL ENSPg, BYVAL 0, LeftOveri 

IF DOS-rrorX THEN PRINT "Dos Error:"; ErrortsgSCUhichErrorX): STOP 

END IF 

NEXT 

F Close LoadFile 

ClearScrO 14, 10, 18, 70, NoroAttr 
LoedlntoEHS = FileENS 
END FUNCTION 

FUNCTION Log2l ixl}_ STATIC _ 

SHARED Log2ConSt! 

Log2! a L06(x!) / Log2Const! 

END FUNCTION 

FUNCTION NUBS (x) STATIC 

NuaS » LTRINS(STRSCx)) 

END FUNCTION 

SUB Ulndrtgr (ULRov, ULCoi. LRRow, LRCol. Praae. BoxColr, TextColr, Text*) STATIC 

CALL BoxOCULRow - 1, ULCol - 1, LRRow ♦ 1, LRCol ♦ 1, Fraae, BoxColr) 

CALL ClearScrfHULRou. ULCol, LRRow. LRCol, FJoxColr) 

CALL QPrintRCCC" ♦ Text* ♦ T, ULRow - 1, ULCol ♦ 1. Text Coir) 

END SUB 

OEFINT A-I 
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KY INVERT. BAS 



Invoked: KYINVERT ConfigFile 



Creates: Ky Invert .Ndx, Ky Invert. Dat ^ 

Uses: JCey.Hdx, Weight. Ncx & Dlct.Vrd for KunKeys 



nth Rec of >Cy Invert. Ndx contains nth Code, 
ptr Into Kylnvert.Dat & Number of Docs Indexed with this word 
TYPE NdxType 

Code AS INTEGER 

Index AS long 

Nun AS INTEGER 

Records are pointed to by .Index of Kylnvert.Dat 
.Rec contains Document that is indexed with this word 
.Value 1000 * Saltan Weight In Document 
TYPE Keylnf oLONG 

Rec AS LONG 

Value AS INTEGER 

J INCLUDE: 1 \\vadia\c-dH ve\user\include\types.bi * 

— now in types. bi above 
TYPE KeylnfoLONG 

Rec AS LONG 1 — record (document) number 

Value AS INTEGER holds the weight of each keyword in the record 
• scaled to fit an integer 
END TYPE 

TYPE LinkNead 

Nun AS LONG nuaber of nodes 1n the linked list 

FirstPtr AS LONG *— pointer to first node 

Last Ptr AS LONG pointer to the last node 

Pad AS STRING * * padding for huge arrays 

END TYPE 



' — holds the actual data, in this case the Keyword's record 0 and Its value 
TYPE LinkNode 

Info AS KeylnfoLONG 

Ptr AS LONG 

END TYPE 

'— Information for VMS routines 
TYPE VHSTableType 

Handle AS INTEGER 

TempFile AS STRING • 62 

END TYPE 



TYPE FilelnfoType 

Year AS INTEGE3 
Month AS XNTEGeS 
Day AS INTEGER 
Hour AS INTEGER 
Minute AS INTEGER 
Second AS INTEGER 
Size AS LONG 
Attrib AS INTEGER 

END TYPE 



CONST NULL ■ 0 



' Color Attributes 

COMMON SHARED Fg, Bg, Brtr, NoroAttr, RevAttr. ShiftValueX, HaxUgt ■ 



• Directories 

COMMON SHARED LstDirS, OccDirS, KeyDirS, NdxDirS. ConfigNameS 
'• Temp variables 

COMMON SHARED NodeLENZ, SodeTemp AS LinkNode 



global ENS usage flag 



COMMON SHARED gfENS AS INTEGER 

nuaber of allocations for VMS routines 
COMMON SHARES gyHSHuaAllocations - AS INTEGER 
COMMON SHARED VHSErrorS 

* — allocation information (handle and filename) 
COMMON SHARED gVHSTableO AS VHSTableType 



•SINCLUOE: '\\vadim\c-drivm\user\include\const.bi • 

* Internal SUBs 

DECLARE SUB AddNode (Head AS LintHead, Info AS KeylnfoLONG, hSterage AS INTEGER, f reePtr AS LONG) 
DECLARE SUB Config C) 

' Internal functions 

DECLARE FUNCTION EasAllocX CNuaSytesS. HandleX) 
DECLARE FUNCTION FIleDateS (FInfO AS ANY) 
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DECLARE FUNCTION Unique* (PathS) 66 



DECLARE FUNCTION VHSAUocX (KuaSytesS, KandleX) 
. DECLARE FUNCTION VHSReleaseX (HandleX) 



— ; External SUBs 

DECLARE SUB Chi dc CNuoberl) 

DECLARE SUB DeleteT CSEG StartE lament AS ANY, ElSizeX, NumElsZ) 

DECLARE SUB EnsAllocnea (NunPagesX, Handle*) 

DECLARE SU8 EfiSGet (SES Value AS ANY, ElSiieX, ELNuol, HandUX) 

DECLARE SU3 Eos Set (SES Value AS ANY, ElSlzeX, ELNus&, Handlel) 

DECLARE sua EosRelhes (KandleX) 

DECLARE SU8 F Close CHandleX) 

DECLARE SUB f Create (FHeNaseS) 

DECLARE SU8 F&etRT (Handler., Destination AS ANY, RecRunberft, RecSizeX) 
DECLARE SU8 Filelnfo (FUeNaaeJ, SEG Address AS ANY) 
DECLARE SUB FOpen (FHeNaseS, Handle* > 

DECLARE SUB FOpenAU (FHeNaseS, AccessKodeX, ShareRodeX, HandleX) 
DECLARE Sua FPut (HandleX, Works) 

DECLARE SUS F Put AH (FileNaaeS, SES Eleoent AS ANY, ElSl2eX, NuaElsX) 
DECLARE SUB FPutRT CHandleX, Destination AS ANY, RecNuhberS. RecSlzeX) 
DECLARE SUB FPtrtT CHandleX, Source AS ANY, RecSizeX) 
DECLARE SUB FSeek CHandleX, Locations) 
DECLARE SUBOPrintRC (StJ, RowX, Co IX, ColrX) 

' : — ; '• •■ — External FUNCTIONS 

DECLARE FUNCTION DOSErrorX 
DECLARE FUNCTION iRrichErrcrX 
DECLARE FUNCTION ErrorHsgS (ErrNuaft) 
DECLARE FUNCTION E«sErrorX 
DECLARE FUNCTION EasLoacedX 
DECLARE FUNCTION EasPaoasFraeX 
DECLARE FUNCTION ExistX (FiltKaoeS) 
DECLARE FUNCTION FileSue* (FileNaneS) 
DECLARE FUNCTION QneColorX (ForeX, BackX) 
DECLARE FUNCTION PeettX (Segnent, Address) 



~ PROGRAM START 

'SSSkeyuord-flagSSS "Last Author: Xw Revision: Xv Date: Xf 

CONST Versions =» "Last Author: TED Revision: 16 Date: 9-Sep-*92,18:51:04" 

CALL Conffg 



LOCATE 18. 1 

QPrintRC "Key Invert (KY INVERT) Started on • ♦ DATES ♦ ? at " ♦ TINES, 1, 15. -1 
Col a (BO - L£N (Version*)) \ 2 
OPrintRC Versions. 2. Col, -1 



number of keywords 



use dictionary to determine total 
LoadFileS =* LstDirS ♦ '»DICT.URD ,, 
DIN DictEntry AS DIctType 
KusKeys - FileSizettLoedFileS) \ LEN(DictEntry) 

NodeLEN a LEN(NodeTemp) 

KyLEM s LEHOCy) 

DM Ugt AS UeightNdx1Z7 
WgtLEN » UN (ugt) 

DIN KeyNdxFlnfo AS FUelnfoTypa 
0111 Count f Info AS FilelnfoType 

*— use KEY.NOX to determine total matter of document* 

LoadF HeS » NdxDirS ♦ "KEY.NDX" 

Filelnfo LoadFilet, KeyNdxFlnfo 

MuaDocsi « FileSizeft(LoedFileS) \ JCyLEN 

FOpenAU LoadFileS, ACCESSREAD, SHAREDENYVRITE. KeyNdxHX 

IF DOSErrorX THEN 

. —LOCATE t8, -1 — 

PRINT -Opening LoadFileS 

PRINT -A DOS Ernw Occurred:-; UhlchErrorX; ErrorflsgSlWnlettlTort) 

END IF 

LoadFileS = NdxDirS ♦ "WEIGHT. NOX* 

FOpenAU LoadFileS, ACCESSREAD, SHAREDENYVRITE, veiohtHdxHX 
IF DOSErrorX THEN w^***™ 
LOCATE 18. 1 

PRINT "Opening LoadFileS 

PUNT -A DOS Error Occurred:-; UhlchErrorX; Erwf!sgS(UhichErrorX) 

END IF 

ern r *e KY INVERT, BOX i n tenia t Ion 
REDW Ndxd TO KuaKeys) AS RdxType 

. DIN Info AS KeylnfoLONQ 
tafoLEM * LEN(Info) . 

— count how aany keyworos there actually ore 
QPrintftC ■Counting Nueber of Keywords in use...-, l <i 
TotKeyaft = 0 • i. l 

LoadFHtS a LatDirS ♦ CenfigNaaeS * ".CUT" 

uu^tEatt" me un - , — « » «• ~ «<• 

Filtlnfo LoadFileS. CountFInfo 

IF FileDateS(CountFInfo) « FileDateSOCeyNdxFInfo) THEM 
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dslne the count file because it's outdated 
KILL LnacrtleS 

END IF 

END IF 

IF NOT Exist (LoadFileS) THEN '> 
FOR DocNua& s 1 70 NunOocsB 

OPrlntRC ST*S(DoeKum&>, 4. 38, -1 
FGetRT Ke/*dxM2, Ky. OocNun&, KyLEN 
TotKeysa » TotKeysS ♦ Ky.Nun 
QPrintRC 5TRJ CTotKeysS), 4, 45, -1 

NEXT 

hFile = FREEFILE 

OPEN LoadFileS FM OUTPUT AS (ShFile 

PRINT #h?Uo, TotKeysg 
CLOSE AhFile 

ELSE 

hFile a FBEEFILE 

OPEN Load FUeS FOR INPUT AS flhFile 

INPUT *hfile, TotXeysi 
CLOSE JhFile 

END IF 

1 — divide into section* and allocate maxinun EMS 
HunfiytesS = 1638U » (Ens Pages Free* - 1) 

KaxNodesS = NuaBytes* \ NodeLEN • • 

NuoSections « 4 

SectionSize! = NuoKeys / NueSections 

try to allocate space in HIS 
NuaPages => £asAlloc(NuBBytes&. InverttetKEM) 

IF NunPages > 0 THEN 

enough ENS was available 

OPrintRC "Nunber of ENS Pages Allocated «" ♦ STRS (NuaPages) ♦ ". HaxNodes =- ♦ STOCnaxitodesS), 6, 1. -1 
gfEMS a TRUE 



ELSE 



not enough ENS was available* use VNS routines 
OPrintRC "Insufficient ENS Available. Requested* 1 ♦ STRS<ABS (NuaPages)) ♦ - ENS pages. ■ 6, 1. -1 
OPrintRC STRSCE-sPagesFreeX) + • ENS pages were available.", 7, 1, -1 
gf ENS s FALSE 

try to allocate space using vns 
IF VRSAUocCNuaSytesft, InvertDatflEN) THEN 

OPrintRC -Nunber of bytes of VNS Allocated =• ♦ STRSCNuaBytesB), B. 1. -1 

ELSE 

OPrintRC -Unable to allocate VMS. - , 8. 1, -1 
STOP 

END IF 

ENO IF 

SaveFileS - NdxDirS ♦ "KYINVERT.DAT" 

FCreate SaveFileS 

FOpen SaveFileS, InvertCatFi le 

IndexS = 1 

FOR Section s 1 TO NuaSections 

KeyLBound ■ (Section - 1) * SectionSize! * 1 
KeyUBound = Section * SectionSize! 

sake sure process to the last k*>* in case of round-off errors 
IF Section « NuaScctiora AND KevUBa-.d «> HunXeys THEM KeyUBound = HunXeys 

RED IN LinkListtKsyLBound TO KeyUBound) AS Linkttead 

OPrintRC "Section- «• STRSCSection) ♦ ": Processing Keys" ♦ STRS(KeyLBound) ♦ '«>"♦ STR$<KeyU8ound) # 10, 1 

. * — initialize evesent nunber of the next free pointer 
FreePtrB » 1 

FOR DocRunB « 1 TO NusDocsS 

OPrintRC "Doc *" ♦ STRSCDocNung) ♦ STRSCFreePtrt) ♦ ■ ", ?2, 1 # -1 

FGetRT KeyNdxHX. Ky, DocHuaa, KyLEN 
FGetRT ksightNdxHJ, Ugt, DocNuaS, UgtLEN 

Info. Ret » DocNuaB 

process the 1 1st of keywords for this docunent 
FOR j a : TO Ky.Nun 

CurrCode = Ky.Code(j) 

*— only add this code to the linked list if it's 

in the section we're currently processing 
IF CurrCode »= KeyLBound AND CurrCode « KeyUBound THEN 

*— if «he weight is larger than the eaxiaun (scaled) weight 
• then just assign it the aaxioua weight 
IF Wgt.ueight(j> > ftaxUgt THEN 
Wgt.Weight(J) « MaxUgt 

END IF 
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scale the single precision weight to an inteser 
Info.Volue - Wgt.WeightCj) • ShiftValue* 

AddNode stores the current Info at FreePtrt and increaents It 
AddNode LinU.ist(CurrCode}. Info, InvertOatrtEfl, trcefzrl 

if FreePtrg was just inc resented above the allocation then stop 
IF FreePtra > HaxNodesft THEN 
Chine 8 
LOCATE 18. 1 

WIHT •Exceeded allocation of; KaxNodese; "nodes." 
PRINT -DocNu»a ="; OocNuaS; ■ j (keyword nuafetr) j 
STOP 

END IF 

ENO IF . - 

NEXT 

NEXT 



• start writing inverted file by traversing the Linked 11st 



*— write out keys for the current section 
FOR t = KeyLBound TO KeyUSound 

»— point to head of linked list for this keyword 
Ptrft = LinkList(i>.FirstPtr 

OPrintRC STRS(i) ♦ • out of • ♦ STRSCICeyOBound) ♦ ■;■ ♦ STR3(L1nkLlsT(1KNua) ♦ - 

NdxCO.Code = i 
Ndx(i). Index = Index* 
NdxCD.Nus = LinklistCO.Nua 

. •— traverse linked-list for this keyword 
DO WHILE ?trft 

IF QfENS THEN 

EssGet NodeTeap, NodeLEN, Ptr*, Invert OatMEN 

ELSE 

FfietRT Invert DatftEN, NodeTeep, PtrS, NodeLEN 

ENO IF 

».— write (append) the Info to disk 

F?utT InvertOatFile. NodeTesp.Info, InfoLEN 

' — get next pointer 
' Ptrft ° NodeTeicp.Ptr 

LOOP 

Indexft ■ Index* + Ndx(i>.Nuo 

NEXT 

NEXT Section 

release eeeory for inverted list data 
IF gfEHS THEN 

EnsfteUteo InvertDatHEH 

ELSE 

IF NOT VHSReteaseUnvertDatHEH) THEN 
chine 8 

LOCATE 22. T . 

PRINT -ProOlea 1i» VHSReLease.- 
STOP 

ENO IF 

END IF 



'— close input files KEY. N OX and WEIGHT. NDX 
F Close KeyNdxHX 
FClose WeightfcdxHX 

»— close ICf INVERT. OAT output file 
FClose Invert DatFUe 

write the KY INVERT. NDX to disk froa the NdxO array 
FPutAH NdxOirS + "KYINVERT. NDX" , Ndxd), LEN(Ndx(1>>, NueXeys 



DATA 0 Copyright 1990-2 by Ted H. Young. ALL RIGHTS RESERVED. ■ 

SUB AddNode (Head AS LinkHead, Info AS KeylnfoLONS, hStorage AS INTEGER, PreePtr AS LONG) STATIC 

IF Head.FirstPtr » NULL THEN 

Head.FlrstPtr s FreePtr 
Head.LastPtr a FreePtr 

ELSE 

IF gfEHS THEN . - 

EosGet NedeTeap, NodeLEN* HeacLLastPtr, hStorage 
IF EasErrorX THEN 
Chine 8 

PRINT "ENS Gat Error in AddNode as Head.LastPtr^*; Head.LastPtr 
STOP 

ENO IF 
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ELSE 

FGetRT hStorage, NodeT&np, Head.LastPtr, NodeLEN 

END IF . 

NodeTeop.Ptr = FreePtr point the node's next ptr to new node 

IP gfEKS THEN 

EasSct NodeTenp, NodeLEN , Head.LastPtr, hStorage 
IF EasErrorX THEN 
Chioe 8 

PRINT "ENS Set Error in AddNode at Heod.LastPtr« N ; Head. Lost Ptr 
STOP 

END IF 

ELSE . 

FPutRT hStorage, Nod»T«ap, Head.LastPtr, NodeLEN 

END IF 

Head.LastPtr = FreePtr set the tast pointer to the new node 

END IF 

* — Increaent count of nooes for this Linked list 
Head. Hua = Head. Nun ♦ 1 

' — store the info in the node and set its pointer to NULL 
NodeTenp. Info « info 
NodeTeap.Ptr ° NULL 

IF gfENS THEN 

EnsSet NcdeTeoo, KodeLEN, FreePtr, hStorage 
IF EnsErrorX THEs 
Chime 8 

PRINT "ESS Error in AddNode at FreePtr="; FreePtr 
STOP 

END IF 

ELSE 

FPutRT HStorage. NodeTenp, FreePtr, NodeLEN 

END IF 

' — Increment pointer ta next free node 
FreePtr = FreePtr + 1 

END SUB 

SUB Config STATIC 

IF COMMANDS <> — THEN 

ConflgFUeS = CCHHANDS ♦ ".CFG" 

ConflgNaoeS = CCfWANDJ 

ELSE 

PRINT "No configuration file was given." 
CALL ChiffieC6) 

DO: LOOP UNTIL LHN(IKKEYS) 
END 

END IP 

IF ExiatXC ConfigFileS) TH£N 

OPEN ConfigFileS FOR INPUT AS 91 

INPUT «1, Fg, Bg, Brdr, LstDirS, DocDirS, NdxDirS, AbstrDirS, LangS, x$ 

CLOSE *1 

ELSE 

- CALL CMn»(6) 

PRINT "Configuration file "; ConfigFileS; ■ Does not exist." 

DO: LOOP UNTIL LSNlINKEYW 

END 

END 

END If 

COLOR Fg r Bg, Brdr 

NoroAttr » OneColorZlFg, Bg) 
RevAttr = OneColorXCBg, Fg AND 7) 



shlftvalueS * 1000 

Haxvgt « 32767 \ ShiftValue* 

CLS 

IF NOT EfflsLoadedZ THEN 

PRINT "NO ENS!" 
CALL Chioe(e> 

DO: LOOP UNTIL LENCINXTO) 
END 

END IF 
END SUB 

FUNCTION EosAllocZ (NuaB/tesS, Handle*) STATIC 

ealcuate nusber of 16K ENS pages to allocate 
NUBPgsZ = NusfivtesA \ 16384 ♦ 1 

return nuaher of pages allocated, 0 if insufficient ENS available 
IF EosPagesFreeX < KunPgsS THEN 
EasAllocZ » -NuaPgs.*. 

ELSE ' 

EfiSAltocHefi NuaPgsX, Handle* 
EasAllocX * NusPgs*. 

END IP- 

END FUNCTION 
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TeopFileS = RTRIW<gVHSTeble( Found). TemPlle) : 
KILL TeapFileS 

DeleteT gVMSTabltC Found), WgmTable( Found)), gVNSNuoAl locations - Found 
.gVBSKuaA I Location* = gVKSMuaA I locations - 1 
REOIH gVNSTabLe(VTO gVHSNuaA I locations) AS VHSTableType 

VHSRelease = THUS 

ELSE ' — something's wrong! 

VnSErrcrZ = 131 •— invalid handle 
vn 3 Re lease = false 

END IF 

END FUNCTION 
OEFIHT A-2 

•FreqCoap version 4.5 3/26/91 15:30 

! tople»«nta the Inverted Index access isethod along with the weighted values 

to calculate the frequent coapanions for each of the words used in the 
1 document collection. 

•Invoked: freqcoap ConfigFile 
•Creates: FreqConp.127 

'Uses: Kylnvert.Ndx, Ky Invert. Da t, Key.Ndx, Weight. Ndx 

•nth Rec of FreqConp.127 contains nth Code, Nun of coapanlons of this code £ 
'127 pairs of (eode,100*veight) 

• TYPE FreqCot«j127 - - 

' Nua AS INTEGER 

Code AS INTEGER 

ConpCI TO 127) AS INTEGER 

Valued TO 127) AS STRING * 1 



CONST ACCESSREAO = Q 

CONST ACCESSURITE = 1 • 

CONST ACCESS READURITE = 2 

CONST SHARECOHPAT a 0 

CONST SHAREOENYREAOURITE = 1 

CONST SHAREDENYVRITE « 2 

CONST SHARED ENYREAD ■ 3 

CONST SHARED ENYRONE 3 4 

CORNON SHARED TRUE, FALSE, rtomAttr, RevAttr 
COffilON SHARED NdxOirS, DocDirS, LstDirS, ASCEND, DESCEND 
CORNON SHARED KeyEflS, KYlnvertNdxEHS. ICYInvertOatEHS, UeightEflS 
COKNON SHARED SIxteefK, ThirtyTwoKS 

1 s INCLUDE: 1 \\vad1o\c-dr ive\USER\iftclude\types. bi • 

DECLARE FUNCTION Array SizeX (ULrow. ULcol. LRrow, LRcol) 

DECLARE FUNCTION OosSrrorS 

DECLARE FUNCTION EnaErrorS 

DECLARE FUNCTION EasGetPrSegX 

DECLARE FUNCTION EasPacesFreeX 

DECLARE FUNCTION EsaLoadedX 

DECLARE FUNCTION EffsMuaPagesX 

DECLARE FUNCTION Exist X (FileNaaeS) 

DECLARE FUNCTION FUeSi*e& (FileNaaeS) 

DECLARE FUNCTION CneColorX (ForeX, BackX) 

DECLARE FUNCTION QlnstrSX (SPosX, Source*, SrchS) 

DECLARE FUNCTION UhichErrorX 



DECLARE SUS Array 2EHS CSSG Eleoent AS ANY, ELSizeX, NunElsX, HandleX) 

DECLARE SUB BCopyT (SEG frosEl AS ANY, 5EG ToEl AS ANY, ElSiieX,- NuaElsX) 

-DECLARE SUB Chine CNuaberS)- 

DECLARE SUB EasAllecNea (NusPagesX, HandleX) 

DECLARE SUB EosGet (SE6 Value AS ANY, ElSlzeX, ElNumg, KandleX) 

DECLARE 5UB EosGetlEl (SES Value AS ANY. ELSizeX, EINuaX, HandleX) 

DECLARE SUB Easiiapnea (HandleX, PhysPageX, LogPageX) 

DECLARE SUB EasRelnea (r-andleX) 

DECLARE SUB EssSet (SEG value AS AMY, ElSlzeX, ELKuaS, Handled) 

DECLARE SUB EosSetlEl (SEG Value AS ANY, ELSizeX, ElNusff, HandleX) 

DECLARE SUS F Close (KandleX) 

DECLARE SUB FGetA (HandleX. BYVAL SegAdrX, BYVAL Ad-X, NuaBytesB) 

DECLARE SUB FGetAH (Filehaac), SES Address AS ANY. ElSlzeX. NuaflsX) 

DECLARE SUB FGetRT (HandleX, Dest AS ANY, Recttol, RecSlzeX) 

DECLARE SUB FOpen (FileKesxS, HandleX) 

DECLARE SUB FOpenAll (FileNaaeS. AccessNodeX, ShareftodeX, HandleX) 

DECLARE SUB FSeek (KandleX, Location*) 

DECLARE SUB In i tint (SEG AddresaX, StartYeLueX, NuaflsX) 

DECLARE SUB ISortS (SEG Element!, SEG IndexEleaentX, NuaEleoentsX, Direction*) 

DECLARE SUB QPrintRC Ctor<5, RowX, CoU, ColrX) 

DECLARE SUB SortT (SEG Array AS ANY, RuaElsX, Direction*, ElSizeX, NeaOflsetX, HenSlzO 



DECLARE SU8 Ccn'g () 

DECLARE SUB Disptttg (RsgS, rX, cX) 

OECURE SUB Release EHS () 

DECLARE SUB UlndHgr CULrowX, ULcoU, LRrouX. LRcdLX, FraaeX, Box ColrX. Text ColrX Te«tS) 
DECLARE FUNCTION EasAllccZ (NuxPages, HandleX, FlUNaeeS) ' 
DECLARE FUNCTION LoadlntoEHSX (FileS) 

STACK 8192 
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FUNCTION FileDateS (flnfo AS FilelnfoType) STATIC 

Year* = HID$(STRS< flnfo. Year). 2> 

HonthS = H10S<STRSCFInfo. Month), 2) 

IF LEN<HonthS) = 1 THEN HonthS • «0" + Month* 

Day* = MlOS<STRSCFInfo.DevJ, 2) 

IF LEN(DayS) = 1 THEN Day* a "0" Day* 

Hour* = NID$<STRSCFInfo.Heur), 2) 

IF LEN(HourS) = 1 THEN Hour* ° "0* + Hour* 

MinutcS s HID*<STR$(FInfo.«inute), 2) 

IF LEN(HinutaS) => 1 THEM K-nute* = *0" ♦ Minute* 

seconds = n IDS cstrs c F info . Second) , 2) 

IF LENC Second*) = 1 THEN S-ccndS = •0" ♦ Second* 

FileDate* « Year* ♦ HontnS * Day* + Hour* ♦ Minute* ♦ Seconds 

END FUNCTION 

FUNCTION UniqueS (Path*) 

IF LEW(PethJ) AND RISHT*( Path*, 1) <» "\" THEN Path* = Path* + m \* 
SeedS « ABSCPeek2XC0, KH46C)) . 'use the TIMER as a seed 

DO 

TeepNaeeS » Path* ♦ HIDS(STR$<Seed&), 2) ♦make a string out of it 
SeeoS = Seed* + 1 r 'Increcent for next tine 

LOOP UNTIL NOT ExiatSCTeooNaoeS) 'Loop and try another nana 

Uniques - TeopNeae* "this is the function output 

ENO FUNCTION 

FUNCTION VMSAUoc (Nuafiytesa, Handle*) STATIC 

create* a teaporary but unique filename 
TecpPathS * ENVIRON* ("TEH?") 
TeopFileS =» UniqueS<Te«pPathS) ♦ U .VHS" 

create the file 
FCreate TeopFileS 
IF DOSErrorX THEN 

VHSAlloc = FALSE 

vnSErrorZ = 155 
EXIT FUNCTION . 

END IF 



open the file 
FOpen TenpFile*. Handled 
IF DOSErrorX THEN 

VMSAUoc - FALSE 

VMSErrorX . 133 

KILL TeopFileS 

EXIT FUNCTION . 

END IF 

allocate disk space: seek, write something, close, reopen 
QPrintRC "Allocating Space...*, 24, 20, -1 
FSeek MandleX, KuaSytesft - 1 

X* a "X- 

FPut HandleX, xS 
FClose Handle! 

QPrintRC * *, 24, 20, -1 

FOpen TespFUeS, HandleX 

IF DOSErrorX TMBI * 

VHSAlloc = FALSE 

VMSErrorX = 136 

FClose HandleX 

KILL TeopFileS 

EXIT FUNCTION 

END IF 

gVHSMuaAllocations n gVMSNuoAl locations ♦ 1 

RED IN gVKSTabled TO gvrtSNusA I locations) AS VMSTableType 

gVMSTab leCgVMSNuoA I locat i ons ) .Handle HandleX 

gmTable(gVNSNiuiAlloeatioris).TeopFile « TeopFileS 



VHSAlloc = TRUE 



END FUNCTION 



FUNCTION VHSRelease (HandleX) STATIC 

IF HandleX « 0 THEN 

VHSRelease « FALSE 
EXIT FUNCTION 

END IF 

close the file 
FClose HandleX 

1 — get the filenese froa the VHSTable 
Found - 0 

FOR i ■ 1 TO gVnSNu&Allocations 

IF gVMSTable(i). Handle » HandleX THEN 
Found • i 
EXIT FOR 

END IF 

NEXT 

»— found it,' so get the filenaae, delete the file and erase the entry 
IF Found THEN 
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•smeyvord-flagSSS "last Author: Xu Revision: Xv Date* Xf" 

CONST Versions = "Last Author: TED Revision: 11 Date: 11-Aug- 92, 14:10:06° 

CALL Contlg 

SpCtOC s INSTRC2, COMMANDS, - •) 
IF SpcLOC = 0 THEM 
Start = 1 

ELSE 

CxxtS = HI0S( COMMANDS, SpcLOC) 
Start = VAL(CitcB) 



END IF 

* : KY1NVERTED 

Load the KYInverxed index Data File 

FileS « NdxOirS ♦ "KYINVERT.NOX" 
IF HOT ExistX(MleS) THEN 

as 

PRINT FileS; ■ not found." 

CALL Release EMS 

STOP 

END IF 

Din NdxTeap AS NdxType 

NdxLEN ■ LEN(HdxTenp> 

TotKeys =■ FileSizeftC FileS) \ NdxLEN 

RED IK InvertNdxO TO Tot Keys) AS NdxType 

CALL FGetAH'FHeS, SE6 InvertNdx(l), NdxLEN, TotKeys) 

transfer it to ENS 

CALL Array2£HSCSE6 InvertNdxO), NdxLEN, TotKeys, KYtnvertNdxEHS) 
IF EssErrorX THEM 

PRINT "BIS Error-; EasErrorX; "occurred." 

STOP 

END IF 

•— erase array: it's 1n 5HS 
ERASE InvertNdx 

• — — open KYInvert.DAT for random access 

DIN RecTENP AS Keylnf oLONG 
RecLEN « LENC RecTENP) 
FileS ■ NdxOirS ♦ "KYINVERT.DAT* 
IF NOT ExistSKFileS) THEN 

as 

Chine 2 

• PRINT FileS; " not found. " 
CALL ReleaseENS 
STOP 

END IF 

FOpenAll FileS. ACCESSREAD, SHARED EKYVRI7E , KYIrtvertDat F ILE 

»sasa load UEI6HT.NDX into ENS if enough room, else open for random access ' 

FileS a NdxOirS ♦ "UEIGnT.NDX" 
IF NOT ExiStZCFileS) THEN 
CLS 

print Files; " net found." 

CALL ReleaseENS 

STOP 

END IF 

DIN UelghxTEMP AS tfeighthdx127 
WelghtLEN = LEN(tfeightTEMP) 
UeightENS ■ LoedlntoEMS (FileS) 

IF UeightENS a 0 THEN not enough ENS to load it in, so just open it 
FOpenAll FileS, ACCESSREAD, SHAREDEHYVRITE^UeightFILE 

END IF" ~~ 

■«-*- Load KEY. KM- into ENS if enough room, else open for rendoa access 

FileS ■ RdxDirS ♦ "KEY.NDX" 
IF NOT Exist* (FileS) THEN 
CLS 

. PRINT FileS; ■ not found." 
CALL ReleaseENS 
STOP * 

END IF 

DIN Kay TEMP AS KeyNdx127 
KeyLEN * LEN(KeyTENP) 
KeyENS = LoadlntoEHSC FileS) 

IF KeyENS a 0 THEN not enough ENS to load it In, so just open it 
FOpenAll FileS, ACCESSREAD, SHARED ENYVRITE, KeyflLE 

END IF 

DIN Freq127 AS FreqCoap127 
DIN BlankFreq AS FreqC©*p127 
Freg127lEN = LEN(Freq127) 

»— Open the FreqCoap.127 file for writing (randoa accessj 
FreqMleJ a NdxDlrS «• "FREOCDHP.12T" 

IF ExtstX(FreqFlleS) ANO Start = 1 THEN kill FreoMleS 

OPEN PreqfileS FOR RANDOM SHARED AS 01 LEM * Freql27L£N 

IndexO is used for the pointer-based integer sort 
RESIN Index CO TO TotKeys - 1) AS INTEGER 
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FSS » "FREOCCHP Started At Record" ♦ STRt( Start) + "On - ♦ PATES ♦ " • ♦ TIKES 
CPrintRC FSS, 1, (80 - LEN(FSS)) \ 2, -1 
OPrintRC Versions, 2, 7, -1 

CALL QPHntRC("Proces*ing 0 out of ♦ STRS(TotKeys), 10, 10, -1) 
FOR Uord a start TO Totteys 

CALL GPrintRC(STRS(Uord), 10, 20,-1) 

EosGetlEl NdxTeap. NdxLEN, Uord, KYlnvertNdxERS 

If NdxTeap.Nua THEN 

clear the VelueO for each Word 
REDIM Value<0 TO TotXeys) AS SINSU 

FOR Inv Index* ° NdxTeap. Index TO NdxTeap. Index ♦ NdxTeep.Nua - 1 

* get the doewtem V pointed to by KY INVERT. NOX . Index 

. frca KYINVERT.DAT 
CALL FGetRT(KYInvertDatFILE, RecTEHP, Invlndex&, RecLEM) 

' get the key codas and nusber for this record 

IF Key EMS THEN 

EmsGet Key TEMP, Key LEA, RecTEMP.Ree, Key EMS 

else 

FGetRT KeyFILE, Key TEH P, RecTEMP.Ree, KeyUN 

EMD IF 

• get the Weight uaLues tor this record 

IF Weight EMS THEM 

EmsGet WeightTEHP, VelghtLEN, RecTEMP.Ree, UeightEHS 

ELSE 

FGetRT ueightFILE, UelghtTEMP, RecTEMP.Ree, UeightLEh 

END IF 

FOR Keyword = 1 TO Key TEMP. Nua 

Code * iCeyTEMP.CodeClCeyvord) 

• ^ ueight to Code's total value sua 

IF VeightTtMP.Ueight (Keyword) a 0 THEN STOP 
Value(Code) - Value(Code) ♦ UeightTEMP.tfeightOCeyworaJ 

NEXT' keyword in this record 

NEXT 'next record that Word appears in 

Freq127.Code = Uord 

■ scale the values to the FreqIZ? word itself 

1 use the coda number to obtain it's value 
ScaleValue! = Value(Freq127.Code> 

1 set the value to zero so it won't come up after sorting 

Valua(Freq127.Code) = 0 

' pointer sort in decreasing order 

CALL InitlnttSEG Index(O), 1. TotXeys) 

CALL IScrtSCSEG valueCQ), SEG Index(O), TotXeys, DESCEND) 

FOR i = 1 TO 127* aax nuaber of Freq Coops 

value! » ValueCIndexO - 1)) 

IF Value! > 0 THEN ' store it 

Freq127.Coop(i) • IndexU - 1> 
VX - Value! / ScaleValue! * 100 
IF V% > 2 5S THEM VX « 2SS 

Frecj127.value(i) * CHRS(VX) 

ELSE 

EXIT FOR 

END IF 
NEXT * 

Freq127.Nua = 1 - V 127 if loop completed, else where it stopped 
PUT #1, word. Preq127 



PUT #1, word. BlankFreq 

END IF 

IF IHKEYS - CHRX(27) THEN 
Chise 5 

PRINT -Press ENTER to abort." 
DO 

iS = INKEYS 
LOOP UNTIL LEN(iS) 
IF iS 3 CHRS(TJ) THEN EXIT FOR 

as 
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CLOSE JTT close the free, cesp 127 list 

CALL ReleaseEHS 

SUB Conf ig STATIC 

IF COMMAND* <> THEN 

Conf igFi leS ■ CGOWKDJ 

SpcLoc » INSTRCCcnfigFileS, ■ •) 

IF SpcLoc THEN ConfigFileS « LEFTS (Conf igFi leS, SpcLoc - 1) 
ConfigFileS 3 ConfigFileS * ".CFG" 
PRINT -Reading "; ConfigFileS 

ELSE 

Chime 5 

PRINT -No configuration file was given." 
PRINT "Press any key to continue." 
DO: LOOP UNTIL LsN(INKEYS) 

END IF 

IF NOT Exist (ConfigFileS) THEN 
Chine 10 

PRINT -File"; ConfigFileS; -was not found." 
PRINT "Press any key to continue." 
DO: LOOP UNTIL LENUNKEYS) 

END IF 

OPEN ConfigFileS FOR INPUT AS #1 

INPUT tt. Fg, Bg, Brdr, LstDIrS, DocOirS, BdxDifS, AbstrDirS, Langs, xl 



COLOR Fg, 8g, Brdr 

NofnAttr • 0neColor5(Fg, Bg) 
RevAttr • OneColorZ(Bg, Fg AND 7) 

SixteenK - 1636% 
ThirtyTwoKft » SixteenK • 2S 

CLS 

IF NOT EesLoadedZ THEN 
Chine 8 

PRINT "No ENS loaded." 
STOP 

END IF 

FALSE = 0 

TRUE a NOT FALSE 

ASCEND s 0 

DESCEND = NOT ASCEND 

END sua 

SU8 DispNsg CttsgS. r, c) STATIC 

STATIC UindOpen, ScrSO * is there already a eessage displayed? 
SHARED Fg, Bg 

IF LENCflsgS) = 0 AND UindOpen THEN 6C3U8 HsgClOse: EXIT SUB 
IF UlndOpen THEN 

CALL Chioe(9> 

OPEN "DEBUG" FOR OUTPUT AS #9 

PRINT #9, "ViRdCoen*"; WindOpen; H£X$(VindOpen) 

PRINT *9, "Msgs- |"; HsgS; »|" 

PRINT #9, •TRUE*"; TRUE; HEXS(TRUE) 

PRINT #9, "FALSE*"; FALSE; HEXS (FALSE) 

CLOSE 09 

CALL ChineCB) 

CLS 

UH1LE LEN(INKEYS) = 0: VEND 
GOSUB HsgClose 

END IF 



Uid a LEN(HsgS) 

IF Uid > SO THEN Uid = 50 

nsgS a HsgS ♦ - «■ sake sure there's a space to find at the end Csee below) 

fiaxLln o LENCflsgS) V uio ♦ 3 
IF rtaxLin » 23 THEN HaxUn ■ 23 
RED IN TextS <KaxLin) 
Lin » 0 
00 

Lin • Lin ♦ V incr current lin # (also elenem in -text display array) 

Lastspc - QlnstrKCUid ♦ 1, BsgS, * ")' look for the last space so we can terd wrap 

TextS(Lin) - LS 5 TS(nsgS, lastspc - 1) 

itsgS .- HlDSdlsgS. lastspc ♦ U'rcaove portion of string that's 1n tS 
LOOP WHILE LENCflsgS) > uid 

HsgS a RTRIBSCKsgS) 
IF LENCflsgS) THEN 

Lin « Lin ♦ 1 

TextS (Lin) = JtsgS . 

END IF . 

IF r <» 0 AND C a Q TH=S ... 
ULr a r 

ELSE 

OLf a 9 

END IF 

DULr a ULr - 1 
LRr a ULr ♦ Lin * 1 
DLSr = LRr ♦ 2 
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horiiaargin = (SO - wid) \ 2 
ULc » horizaarz-n 
DULe = ULc - 3" 
IRc a 80 - bor:swrgin 

ll?l d tic 2 l 1 2 = m " LRC * r 1f 1T ' $ UIDth ' then bu »P * e "coluo* 

REDIN ScrXCArpay$i2e5<CwTj-, OULe, DLRr, DLRe>) 

CALL ScrnSat/eOCDULr. DLlr, DLRr, DLRc. S£G SerXCO}) 

CALL VindngrCULr, ULC LRr, LRc, A, NornAttr, RevAttr, "Status") 

FOR i » 1 TO Lin 

CALL QPHnXRCCTextSCD, ULr ♦ 1. ULc ♦ 1. -1) 

NEXT 

r = ULr ♦ Lin 

c 3 ULc ♦ t ♦ LEN(TextS'Lin)) 

IF LENCTextS(Lin)) ♦ 2 o «id THEN c = ULc ♦ 1: r = r ♦ 1 

ERASE ToctS 
UindOpcft'o TRUE 

EXIT SUB 

NsgClose: ~~ "~~ " clos * w,nd ~ 

CALL ScrnBestCKDULr, DULc, OLRr, DLRc, SEG ScrX(O)) 

ERASE ScrX 
UindOpen « FALSE 



END SUB 

FUNCTION EasAllocX (NujnPgsi, HandleX, FileNaoeS) STATIC 

IF EosPaeesFreeZ < ffuaPgs THEN 
EosAlLoc = FALSE 

ELSE 

EasALlocneo NuaPgsX, Handle! 
EmsAlloc = TRUE 

END IF 

END FUNCTION 

FUNCTION LoadlntoSRS (FUeS) STATIC 

' Returns the handle where the file was loaded into 

ENSPg « EnsQetPFSegX 

Si reof Fileft » FileSiaeSCFileS) 

NusPagts = Siceof FileS \ SixteenK ♦ 2' round off to nearest 2 oases 
IF NOT EosAllocCNuaPages, FileEHS, FileS) THEN 

LoadlntoEHS = 0 

EXIT FUNCTION 

END IF 

Nun32kB Lodes a SiieofFU«i \ ThirtyTwoKS 

Leftovers « SfieofFUeS - <Nua32kB locks • ThirtyTwoKS) . 

FOpen Files , LoadFlle 

FOR 1 = 1 TO Nua32kSLocics. ♦ 1 



OPrlntRC -Loading " ♦ MlsS ♦ " block- ♦ STR$<1) ♦ - /• ♦ STRSCRu»32kBlocka ♦ 1> ♦ ■ 

oap pages of :ne ENS oeaory to the ENS upper m page frase 
FOR j » 1 TO 2 

EosttapHea Fi leEMS, j, (1 • 1) * 2 + j 

— I^£MErrcr*.TH£N.PRINT^FjaseiTor:5;-EasEprorX: -STOP 

NEXT 

seek to beginning of current block 
FSeek loadFUe, (i - 1) * ThirtyTwoKS 
IF DojErrorX THEN PRINT "Dos Error:"; UhichErrorX: STOP 

IF i < Nua32kaiocks ♦ 1 THEN 

'— 9« the 32k block and put it directly into the ENS page frao* 
FGetA loacTile. BYVAL EKSPg, BYVAL 0, ThirtyTwoKS 
IF DoscrrcrX THEN PRINT "Dos Error:"; UhichErrorX: STOP 



ELSE 



END IF 
NEXT 

PC Lose LoadFUe 
LoadlntoEHS = FileENS 
END FUNCTION 
SUB ReleeseEHS STATIC 

IF KeyENS THEN 

EosRelMeo KeyENS 
ELSEIF Key FILE THEN 

FClose Key FILE 

END IF 



load the loft over <«32*) bytes 
FGetA loacFile, BYVAL EHSPg, BYVAL 0, LeftOverS 
IF OoslrrerX THEN PRINT "Dos Error:*; UhichErrorX: STOP 
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IF KYInvertfidxEMS THEN EosRelHeo KYInvertNdxEHS 

IF KYJnvertOatEHS THEN EnsRelHeo KYInvertDatEHS 

IF VeightENS THEH 

EosRelMe* tfeightEHS 
ELSEIF UelghtFILE THEN 

FClose UelghtFILE 

END IF 
END SU8 

SUB Uindhgr (ULrow, Ulcol, LRrow, LRcol. frame, BoxColr. TeatColr,. TextS) STATIC 

CALL 8o*0(ULnjw - 1, ULcoL - 1, LRrow + 1, LRcol ♦ 1, Fraae, BoxColr) 

CALL CtearScrOCULrow, ULcoL, LRrow, LRcol, BoxColr) 

CALL QPrintRCCf ♦ Texts * -3* r ULrow - 1, ULeol ♦ 1, TextColr) 

END sua 

DEFINT A-Z 

' RELATIVE. BAS 

i , , , 



are there any FreqCenps for this keyword? 

i ^ found th» word in j's FCList 

— look for the word itself in word j's FCList 

' apply foraula of (Lower * 6 ♦ Higher) /7 

' — sort in decreasing order, by value 
' save first 65 as the relatives 

•Invoked: RELATIVE ConflgFlle 



'Creates: Relative. 63 

' nth Record of Relative. 63 contains nth Code 
' Kuo of relatives of this code & 



|Uses: FreqConp.127, Dict.Mrd <for NuaXeys keyword count) 

'63 paired values: Coopanion Codes (ConpO) and 100 - beight (ValueO) 

• TYPE FreqCcop63 

Kua AS INTEGER 
Code AS INTEGER 
ConpO TO 63) AS INTEGER 
Valued TO 63) AS STRING • 1 

•SINCLUDE: l \\^imVc-dHva\usar\include\types.hi > 

' Color Attributes 

COHNON SHARED Fg, Bg, Brdr, NoraAttr, RevAttr 

• Directories 

COMMON SHARED LstDirS, DocDirS, KeyOirS, NdxDirS 
•SINCLUDE: '\\wad1«\c-d^1ve\user\^ncllJlle\const-bi• 

■ ' Internal SUBs 

DECLARE SUB Config C) 
DECLARE SU8 EosAlloc (HuaPgsX. HandleX) 

' '• External SUBs 

DECLARE SU8 Chine (Number) 
DECLARE SUB EnsAllocHeo (NuaPogesX, HandleX) 
DECLARE SUB EosHapMea (HandleX, PhysicalPegeX, LogiealPegeX) 
DECLARE SUB EasGet (SEG Value AS ANY, ElSireX, ElHuaS, HandleX) 
DECLARE SU8 Cosset (SEG Value AS ANY , ElSireX, ElMua8, HandleX) 
DECLARE SU8 EasReWe* (HandleX) 
DECLARE SUB FClose (HandleX) 
DECLARE SUB FCrwate (FileNaaeS) 

DECLARE SUB FGetA (HandleX, BYVAL SegAdrX, 8YVAL AdrX, NuaBytesB) 
DECLARE SUB FGetAH (FileNaaeS, SEG Element AS AMY, ElSixtX, KusEUX) 
DECLARE SUB FGetRT (HandleX, Destination AS ANY, RecNucbtrfc, ReeSi zaX) 
DECLARE SUB FCpen (FileNaaeS, HandleX) 
DECLARE SUB FPutT (HandleX, Source AS ANY, RecSizeZ) 
DECLARE SU8 FPutHT (HandleX, Source AS ANY, ReeNUBberg, ReeSiieX) 
DECLARE SUB FSeek (HandleX, Locations) 
DECLARE SUB QPHntRC (StS, RovX, Co LI, CoLrXJ 

DECLARE SU8 ScrtT (SEG Address AS ANY, NmCLsX, Dirt. EtSizeX, ReaOffsetX, hemSlzel) 

• ■ : ; : External FUNCTIONS 

DECLARE FUNCTION EaaLoadedX 

DECLARE FUNCTION EosErrorX 

DECLARE FUNCTION EasPagesFreeX 

DECLARE FUNCTION EasGetPFSegX * • 

DECLARE FUNCTION ExistX (FileNaaeS) 

DECLARE FUNCTION OneColorX (ForeX, BackX) 

DECLARE FUNCTION FUeSiteft (FileNaaeS) 

DECLARE FUNCTION UhichErrorX 

DECLARE FUNCTION DosErrorX 

DECLARE FUNCTION NaxIlttX (aX, bX) 

DECLARE FUNCTION MnlntX (aX, bX) 
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— PROGRAM START 

CALL Con fig 

5£H ItanHutuolCount AS LONS, liutualCowit AS LONS 
Threshold ■ 0 
ThirtyTwott = 32 * 
rSSPg s EmsGetPFSegX 

' — determine nusber of dictionary entries 

*ileS « LstOirS ♦ "OICT.tfRD" 

MM OictEntry AS OictType 

SuaKeys * FileSize8CFUeS> \ LEH<DlctEntry) 

3IM Frecj127 AS FreqComp127. Freq127obJ AS FreqCoop127 
?req127LEN « LEN(Freq127) 

fileS « Nd*DirS ♦ "FREOCONP.127" 
SiieofFilea » FileSize&CFi le$) 
NgmFreqCoop ° Siieof FileS \ Freq127LEH 

Wages = Siieef File* \ 16384 ♦ 2" round off to nearest " :eses 
EasAtloc NuaPages, FreqConpEHS 
■ If EosErrorX THEN PRIKT "Ems error:"; EasErrorX 

*.-«3 Z k8locks = SizeofFile* \ ThirtyTvott 

Leftovers a SizeofFileS - CNua32k8locks * ThirtyTueKg) 

*0pen FileS, FreqCompFILE 
=0R i = 1 TO Kua32kfilocks ♦ 1 

GPrintRC "Loading block" ♦ STRS(i) ♦ " /» ♦ STRS<N.m32kBlocka ♦ 1), 3, 1, -1 

nap pages of the FreqCospEAS memory to the EMS usoer kb page Ires* 
FOR J a 1 TO 2 

EasMapHem FreqCompENS, J, <1 - 1) * 2 + j 

IF EasErrorX THEN PRINT "Ems error:"; Ens=-ror% : STOP 

NEXT 

seek to beginning of current block 
FSeek FreqCompFILE, (i - 1) * ThirtyTwoKft 
IF DosErrorX THEN PRINT "Dos Error:"; UhichErrorX 

IF 1 < Num32kBlocks ♦ .1 THEN 

get the 32k block and put it directly :no the ENS page trace 
FCetA FreqCospFILE, BYVAL EmsPg, BYVAL 0. ^hirtyTwott 
IF DosErrorX THEN PRINT "Oca Error:"; UhicrSrrorX 

ELSE 

load the left over <«32k) bytes 
FGetA FreqCompFILE, BYVAL EmsPg, 8YVAL 0, ^eftOverft 
IF DosErrorX THEN PRINT "Dos Error:"; UhichErrorX 

END IF 

NEtfT 

FClose FreqCompFILE 
OS 

DIM Rel AS FreqCoep63, BlankRel AS FreqConp63 

RelLEN = LEN(Rel) 

' Initialize the Blank Relative structure 

BlankRet.Nua * 0: BlankReLCode = 0 
FOR i » 1 TO 63 

BlankRel. Comp(i) * 0 

BlankRel. ValueCj) • CHRSCO) 
NEXT ». 

FileS s NdxDirS ♦ "RELATIVE. 63" 

FCreate FileS 

FOpen FileS, RelativeFile 

CLS 

QPrlntRC "Processing Keyword 0 out of " ♦ STRSCRus*eys>, 10, 10, -1 

FOR 1 s i TO NuaFreqCoap 

EmsGet Preq127, Freq127LEN, CLNG(i), FreqCoscPlS 

' fill this rec ulth 0*s In case there are no FreqComps 

* also to dear the variable in case there aren't 63 Relatives 

Rel « BlankRel 

Rel. Code « Freq127.Code 

QPrintRC STflSCO, 10, 29, -1 

'— are there any FreqCoaps for this keywers? 

If Freq127.Nua THEN yes, so process it 

. UordslnUse = UordsInUse ♦ 1 

RE9IN RelValfl TO Freq127.NusO AS Coeetoiue 
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FOR J * 1 TO frea127.Nuo 

RelVal(j).Code = Freq127.Coac?*r) 

Re IVal(j). Value = ASCCFreql27.YaUe<j)> 

EoaGet Freq127obj, Freql27LEh. CUOCFi^Z^CoiapCj)), FreqCtwpEHS 
Found a FALSE 

'— look for the word itself in word i'a FCList 
FOR k 3 1 TO Freq127obj.Nun 

IF Freq127obJ.Coap<k) = Freq127.Code THEN 

■— we found ~e word in j's Faiat 
; «PPly forsula of Lower * 6 ♦ Higher 



ObjValue s AS-:freq127obj.V»lua(k» 

Hax = naxInt(3elVaUj). Value. ObjValue) 
Mm • «inlntCSelval(j).Vattte, ObjValue) 

RelVat(j).VaUe = (Max ♦ (flin * 6)) / 7 

Found - TRUE 

CUT FOR 



END IF 

NEXT 



IF Found THEN mutual 

Mutual Count a Mutual Count *■ 1 
ELSE '- non-cut ua I 

NonNutualCount = NonRutualCount ♦ 1 
RelValCj). Value s RelValCj). Value / 8 



IF 



• sort In decreasing order, by value 

SortT RelVaUl), Freq1Z7.Nua, DESCEND. LENCRelVaUD), LENCRelVa 1(1). Code). 

t save first 63 as the relatives 

naxRel « 63 

IF naxRel > Freq127.Nufl THEN ftaxRel = Freq127.Ni* 
FOR j = 1 TO naxRel 

If ReUal(j).value >- Thresncid THEN 

Rel.CoapCj) » RelVal(j).Code 
Rel.Value(j) • CHUCReiVaUjKValue) 
Rel.Nua = Rel.Nua ♦ 1 

ELSE 

EXIT FOR 

END IF 

NEXT 

END IF 

* atore this record on disk 

FPutRT Relativefile, Re I, CLNG(i), ReiLEN 
IF INXEYS « CHRS(27) THEN Chiae 6: EXIT FOR 



NEXT 



' close output file 
FClose Relativefiie 

' release FreqComp use of ENS 
EasRelftea FreqCoepEHS 

OPEN "IWTUAL.DAT" FOR OUTPUT AS 1 

PRINT ffl, "Nuaber of Mutual: »; RutualCoun 
PRINT #1, "Nuaber of Non-flutual: ■; Honflutu .Count 
PRINT n, "Nuaber of words In -use:*; UorcsIrWae " 

CLOSE #1 

SUB Config STATIC 

IF COMMANDS <> THEN 

ConfigFileS « COMMANDS ♦ VCF6" 

ELSE 

PRINT "No configuration file was given.-" 
CAU Chioe(6) 

00: LOOP UNTIL LEN(INXEYJ) 
END 

END IF 
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IF Ex1stZ(Conf1gF1leS) THEN 

OPEN ConfigFileS FOR INPUT AS 9\ 

INPUT 01, Fg, 6g, Brdr, LstOirS, DoO*rS, NdxOirS, AbstrDirS, UngS 

CLOSE tfl 



ELSE 



CALL Chine<6) 

PRINT •Configuration file "; ConfigFileS; " Dees not exist." 
PRINT ■Press any key to continue." 
00: LOOP UNTIL LENCIMCEYJ) 
END 

• END IF 

s 

\ COLOR Fg, Bg, Brdr 

\ 

! NornAttr = OneColorXCFg, Bg) 
•RcvAttr * OneColorXCBg, Fg AND 7) 

jCLS 

.QPrintRC "RELATIVE", 1, 36, -1 
. df NOT EosLoadedX THEN 
i PRINT "NO ENS!" 

' j CALL Chioe(6) 

00: LOOP UNTIL LEN(IWCEYS) 
END 

END IF 
END SUB 

SUB EssAlloc (NuBPgsZ, HandleZ) STATIC 

'» 

IF EnsPagesFreeZ < NuoPga THEN 
CALL Otine(B) 

PRINT "Couldn't Allocate "; CLNGCNumPgs) • Si»taenK; ■ bytes of ENS C"; NuoPgs; " pages). ■ 

PRINT "Only "; Eos Page* F reel; • pages were a-ai labia." 

STOP 

ELSE 

EasAllocNea HuoPgsX, HandleZ 

END IF 



OEFINT A-2 
•REL-INV.BAS 

•Invoked: rel-inv conflgflle 
'Creates: Rel-Inv.Ndx, Rel.Inv.Dat 
•Uses: Relative.63 & Dict.Vrd for NuoXeys 

•nth Rec of Rel-Inv.Ndx contains nth Code, ptr into *el-lnv.Dat ft NvaRelsOfCode 

• TYPE NdxType 

Code AS INTEGER 

* Index AS LONG 
Nun AS INTEGER 

•pointed to record of Rel-Inv.Oat contains coda of r*e first FreqCosp list 
•of nth coda ft the nth code's value in that list 

• TYPE RecYalue 

* Rec AS INTEGER 
Value AS STING * 1 

•SINCLUDE: 'WvadiDVe-driveXuserVincludeVtypes.bi ' 

TYPE LinkHead 

NUB AS INTEGER 
FlrstPtr AS LONG 
LastPtr AS LONG 
Pad AS STRING * 6 

EKO TYPE 

TYPE LinkNode . . 

* holds the actual data, in this case the fc record 8 and its value 
Info AS RecValue 

Ptr AS LONG 

ENO TYPE — ~ 



' Color Attributes 

COHHCN SHARED Fg, 8g, Brdr, NornAttr, RevAttr 
' Directories 

COWCN SHARED LstDirS, OocDirS, KeyOirS, NdxDirS 
' Tesp variables 

COMNON SHARED NodeLENZ, NodeTeep AS LinkNode 

'5 include: •\\v3dtB\c-dr1ve\user\1nclude\const.b1' 

* Internal SUBs 

DECLARE SUS AodNode (Head AS ANY, Info AS RecValue, Storage* AS INTEGER, FreePtr AS LONG} 

DECLARE SU3 Conf 1g O 

DECLARE SUB EasAlloc (NunPgsZ, HandleZ) 

»— ■ — External SUBs 

DECLARE SUB EosAllocftaa (NuaPagesX, .HandleZ) 

DECLARE SUB EasGet (SEQ Value A3 ANY, ElSizeZ, EINual, HandleZ) 

' DECLARE SUB EasSct (SEG Value AS ANY, ElSizeZ. ElKuai, HandleZ) 

DECLARE SUB EnsRelKen (HandleZ) 

DECLARE SUB FClose (HandleZ) 

DECLARE SUB FCreate (FileNaeeS) 

DECLARE SUB FGetRT (HandleZ, Destination AS ANY, RetNuaberft, RecSizeX) - 

DECLARE SUD FOpen (FileNaaeJ, HandleZ) 
■ DECLARE SUB FPutAH (FileNaaeS, SEG Element AS ANY, ELSiwX, NuaELsZ) 
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DECLARE SUB FPutT (HandleX, Source AS AMY, ReeSizeX) 
DECLARE SUB QPrintRC (StS, RovX. CoLX, ColrZ) 



External FUliCTICKs 



DECLARE FUNCTION I 
DECLARE FUNCTION EasPagesFreeX 
DECLARE FUNCTION Exist! CFUeNaaeS) 
DECLARE FUNCTION OneColorZ CForeJ, BackZ) 
DECLARE FUNCTION FUeSlzea CFUeNeBeS) 

. ' PROGRAM START 

CALL Conf ig 

LoadPlleS = LstDirS ♦ "OICT.WRD- 
Dill d AS Diet Type 

NunKeya = FUeSlze&doadFile*) \ LEN(d) 

DUI Freq63 AS Frcqtoip63 
Freq63LEN " LBI<Freq63) ' 

LoadFileS = NdxOirS + "RELATIVE. 63" 
NunFreqCoop - FileSireaC LoadFileS) \ Freq63LEN 
FOpcn LoadFUeS, FreqCoapHX 

RED IN LinkLiatO TO ItuaXeys) AS LinkHead 

DIN Info AS RecValue 
InfoLEN = LENCInfo) 

NuaPages = NodeLEN * CLNGtNuaf reqCoap) • 63 V 163at* - 1 
EnsAlloc NuaPages, InvertDatEHS 

FreePtr* = 1 

as 

FOR i s 1 TO NuttfreqCoap 

OPl-mtRC STRSCI) ♦ " OUt Of • ♦ STRS(NUBFrec^O), 10, 1, -1 
FSetRT PreqCcwpHX, Freq63, CLKG(i), Freo63LF» 
Info.Rcc = Freq63.Code 
FOR j * 1 TO Freq63.Kun 

Info.Value * Freo63.Value<j) 

AddNcde LinkLi3t(Freq63.C©ep(j>>, Ir/«a, InvertDatENS, FreePtrg 

NEXT 

NEXT 

• close input file 
FCloae FregCcapHX 

1 stairt writing inverted file by traversing the linew list 



as 

RED IN NdxCI TO Ruexeys) AS NdxType 

FCreate NdxDirS ♦ "REL-INV-OAT* 

FOpen NdxDirS ♦ "REL-INV.DAT", InvertDatFile 





FOR 1 = 1 TO NualCeys . 

OPrintRC STRS(i) ♦ ■ ■ ♦ STRSainkLi.t(i).Nui) * - - # 10, 1, -1 

Ptra ■ LinkLiat<O.FirstPtr 

ttdx(i). Index « Index* 
Ndx<i).Nuo - LinkLiatCO.Nua 

00 WHILE PtrS 

EasGet NodeTeep, NodeLEN. PtrS, InvertDatEftS 
FPutT InvertDatFile, NodeTeep. Info, InfoLEN 
Ptra a NodeTeep. Ptr 

LOOP 

IndexS = Index* ♦ NdxCI). Nub 

..EXT 

EatsRelllea InvertOatERS 
FClose InvertDatFile 

FPutAH NdxDirS ♦ -Ra-MV.NOX", Ndx<1), L£N(Ndx(1)), NustCeys 
DATA * Copyright 1990 by Ted N. Young. ALL RIGHTS RESERVED. • 
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5U8 AddNode (Head AS LinkHead, Info A5 RecYelue, Stc^egeH AS INTEGER, FreePtr AS LONG) STATIC 



IF Head.FiPStPtr a NULL THEK 

Head.FirstPtr = FreePtr 
Head.LastPtr = FreePtr 

ELSE 

CALL EasGetCNodeTeep, NodeLEN, Head. Lest Ptr, Storage*) 
NodeTenp.Ptr • FreePtr 

CALL EmSetCNodeTenp, HodeLEN, Hcad.UstPtr, StoregeH) 
Heed.LastPtr • FreePtr 

END IF 

Heed. Hum ■ Kead.Nuo ♦ 1 
NodeTeop.Info = Info 

NodeTenp.Ptr = NULL " 
CALL EosSetCNodeTeep, NodeLEN, FreePtr, StoraswO 

FreePtr = FreePtr ♦ 1 

END SUB 

SU8 Cenf ig STATIC 

IF CORHANDS <> THEN 

ConflgFHeS = COWWNM ♦ ".CFG* 

ELSE 

PRINT "No configuration file was given." 
PRINT "Press any key to continue." 
CALL Chine (6) 

00: LOOP UNTIL LEN(INKEYS) 
END 

END IF 

IF ExistttGonfigFUeS) THEN 

OPEN ConfigFileS FOR INPUT AS n 

INPUT ft, Fg, Bg, Brdr, LstDirS, DosiirS, NdxDirS, AbstrOirS 

CLOSE n 

ELSE 

CALL Chiee<6> 

PRINT "Configuration file ConfigFileS; n fees not exist." 

PRINT "Press any key to continue." 

00: LOOP UNTIL LENCIKKEYS) 

END 

END 

END IF 

COLOR Fg. Bg, Brdr 

NorsAttr ■ OneColorXCFg^ Bg) 
RevAttr = OneColorXCBg, Fg AND 7) 

CLS 

OPrintRC "REL-INV", 1, 36, -1 
IF NOT EnsLoadedZ THEN 

PRINT "No ENS!" 

CALL Chtoe(6) 

DO: LOOP UNTIL LEM HOSTS) 
END 

END IF • 
NodeLEN = LEN(NodeTeop) 

END SUB _ . _ 

SUB EosAlloc (NuaPgsJ. HandleX) STATIC 

IF EnsPagesFreeX < BuaPgs THEN 
CALL ChioeCB) 

PRINT "Couldn't Allocate CLNGCNuaPgs) • SiiteenK; ■ bytes of ENS C"; Nuapgs; - pages)." 
PRINT "Only EBsPagesFreeX; B pages vera *ai labia." 

STOP 

ELSE 

EosAllocfles NuaPgsX, HandleZ 

END IF 
END SUB 
OEFINT A-Z 

' POLYSEMY. BAS: Builds: PolySenjr-Dat & PolyAvg 

• Uses : Relative. 63, Rel-InvS.Ndx, Rel-lnv.Dat 

'nth Rec of PolySeny.Dat contains Poly Value of nth «erd. calculated as follows: 

•POLY FOftnULA: 

• PolySenytVord) » L062< < 1/600) * ( (Avg3 * tAvg3/A,g2D) ) * 1.68 ♦ CAvg6 * <Avg6/Avg63> ) " 1.8) • <S ' CKu»^rcsUhichAr*fitLs/DcesU1 
•New farm La 3/28/92 EPS 

• PclySeoyCUord) = L0Q2C (1/1000) • < Uvg3 * <Avg3/Ajs20) ) • 2.2 ♦ CAvg6 * (Avg6/Avg63) ) * 2.2) * U " .(NuaUordsWhlchArefiela/Docsui 

Equivalent formula- 

• PolySenyCUord) * LO2CUvg3*tA>/o3/Avg20))^.2*CAvg6»tt^^ 

• PolyAvg 'is average poly value 

'New formula 6/1/92 VN 

• PvlySenyWord) oOOG2CCAvg3*<Avg3/Avg20))^.2+(Ave*^ 
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•« New forisulo 6/19/92 THY • 

* / Avg3 - Avg6 \ 

* PolySeoy(Uord) ° SG8< TotolRelValue / DocFreq ) x SM( Avg3 • ♦ Avg6 * ) 

; \ Avg2Q Avg63 / 

■ TotRcLVol is the sua of relative values froa all relative lists 

* in which the word appears 

•SINCLUDE: i \\vadio\c-dr^ve\user\include\types.bi , 

TYPE PolyType 

Code AS INTEGER 
Value AS SINGLE 

Pad AS STR1MS * 2 needed to aaka this type a *2 for huge array 



'— Color Attributes 

CCHNON SHARED Fg, Bg, Brdr, NnrmAttr, RevAttr 

* — Cosaon Temp Variables 

COWON SHAREO TeapS, IX, jX, kX, LX, XX 

• — Coaoon Constants 

COHHQN SHARED Sixteen*. ThlrtyTwott 

' — File Directories 

COmON SHARED LstOirS. DocDirS, KeyDirJ, NdxDirS 
COniON SHARED OictwordENS, RellnvDatEflS 

*SINCLUDE: ' \\vadim\c-dri ve\user\include\const.bi ■ 

• i Internal SUBs 

DECLARE SUB Conf ig () 

DECLARE SU8 Dispttsg (NsgS, rX, cX) 

DECLARE SUB EasAlloc CKuaPgsX, HandleX, FiLeNaaeS) • - 

DECLARE SUB ReleoseEHS () 

DECLARE SUB WindHgr (ULRowX, ULCoLX, LRRcwX, LRColX. Fra«eX, BoxColrX, TextCoLrX, TextS) 

• Internal FUNCTIONS 

DECLARE FUNCTION Diets (CodeX) 

DECLARE FUNCTION FreoX (Array () AS NdxTvpe, TotX, Cecal) 
DECLARE FUNCTION LoadlntoERSX (FlleS) 

' External Declares 

'SINCLUDE: 'WvadlaXc-drlveNuserXincludeXdeclares.br 

• PROGRAM START 

CALL Conf ig 

QPrintRC "POLYSEMY**, 1, 35, -1 

'SSSkeyword-flegSSS "Last Author: Xw Revision: Xv Data: Xf 

CONST versions = "Lost Author: TED Revision: 17 Date: 21-Aug-92,15:32:18" 

Col = (BO - LEN(VersionS)) \ 2 
QPrintRC Versions, 2, Col, -1 

'— Load in Code— >Uord dictionary directly into ENS 

LoadFILES = LstDirS ♦ "DICT.URD" 
DictUordENS = Load IntoEHS( LoadFILES) 

Load in REL-INV.DAT 

LoadFILES * NdxDirS ♦ "REL-INV.DAT" 
RtUnvDatEHS = LoadlntoEMS CLoadF ILES) 

DIN RallnvOat AS Recvalua _ . : 

RellnvDatLEH = LBKRellnvDat) 

• — keyword Inverted file Index (which holds the keyvcrs cade and 

the nuaber of documents in which 1t appears tDccFreqJ) 
LoadFILES a NdxDirS ♦ *TCf INVERT . NDX" 
Oin NdxTeap AS NdxType 

'— * get the nuaber of keywords 
IF Exi si (LoadFILES) THEN 

NuaKey ° Fi leSize&( Load FILES) \ LEM(NdxTcap) 

ELSE 

PRINT LoadFILES; ■ was not found. 0 

ReUaseEKS 

END 

END IF 

1 — dimension array to hold the inverted index info 
RED IN NdxO TO NuaKey) AS NdxType 

load the file into tha array 
CALL FGttAH( LoadFILES, Ndx(1), LEN(ftdxTeop), BusXey) 

DIM Rel AS FreqConp63 
RalLEN n LEN(Rel) 

LoadFILES = NdxDirS ♦ "RELATIVE. 63" 
NuaRel * FILeSlzeftCLoadFILES) \ RelLEM 
CALL FOpen (LoadFILES, RelatlveFHe) 

" •** OIN RellnvNdx AS NdxType' 
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REDIH RsilnvNdxO TO NunBel) AS SaallNdxType 
LeadFILES = NdxOi rt ♦ "REL-INVs.NDX" . . 

'** CALL FOpenCLoadFILES, RellnvNdxFHe) 

load the file Into the array 
FGetAH LoadFILEJ, RellnvHdxd), LEfKRellnvNdxM)), HusSs! 

CLS 

PRINT "Processing"; NuaRsl; ■ words..." 
QPHntRC "First Pass Polyvalues:", 10, 1,-1 

QPrintRC "Second Pass, Final Polyvalues:*, 11, 1,-1 _ ■ 

QPrintRC "Printing Pass, writing to disk:", 12, 1, -1 " 

RED in PolySeoyO TO NusrRel) AS PolyType 
REDin PolySeoyOrigCI TO NunRel) AS SINGLE 
RED in TotRelValueCI TO NusRel) AS INTEGER 

FOR 1 o 1 TO KuaRel 

CALL QPrintRCCSTRSCi), 10, 23, -1> 

CALL FGetRTCRelati veFile, Rel, CLKC(i), ReiLEM) 

PolySeay(i).Code * i 
PolySeny(i). Value » 0 
PolySeoyOrig(i) » 0 

IF ReLltua > 0 THEM ' calculate it, otherwise sicio it 
Sua ° 0 

Avg3! = 0: Avgo! = 0: Avg20l a 0: Avg63i * 0 
FOR j = 1 TO 63 

IF j « ReL.Nuo THEN 

code s Rai.coopCj) 

•** Fq ° Ndx(Code).Nua Not used? 

value a ASC(Rcl.Valuetj)) 

ELSE 

Value .« 0 

END IP 

Sua = Sua ♦ Value 
SELECT CASE j 
CASE 3 

Avg3! » Sun / 2 

CASE 6 

Avg6! » Sua / 6 

CASE 20 

Avg20! * Sua / 20 

CASE 63 

Avg63! = Sua / 63 
CASE ELSE 
END SELECT 

NEXT 

IF Avg63! > 0 AND AvgZD! > 0 THEN 
• — SQR is the saw as " .5 

Poly! a SQRCAvg3! • (Avg3! / AvgeO!) ♦ Avg6! » CAvg6! / Avg63l» 

ELSE 

Poly! = 0 

END IF 

PolySeoy<1>. value = Poly! 
PolySenyOHgCi) =» Poly! 



END IF 

ZF INKEYS » CHRSC27) THEN 
INPUT "Exit?", ynS 

IF UCASES (LEFTS (ynS, 1)) ■ "Y" THEN EXIT FOR 

END IF 

NEXT 

CALL FCLoseCRelativeFUe) 



0IM PolyValue AS PolyValueType 

OPEN KdxOirS ♦ "P0LYSENV.0AT" FOR RANDOM AS 1 LEW a LES( PolyValue) 

Log2! a LOGC23 
Log4! a L0GC&) 
LoglOOO! = LOG (1000) 

AvgPoly! = 0 

FOR i a 1 TO NuaffoT ~ 

CAU QPHntRC(STRS(i), 11, 31, -1) 

IF PolySeaytO. value > 0 THEN 

'•• FGetRT RellnvNdxFlle, RellnvNdx, ONGCi), LEMCRellnvNdx) 
PolyFreq => NdxlD.Nuo 
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changed inmT^'^ = P ° lyStt * (i) < VaU * ' CClrtWlta / PolyFreq) " 4) 

i 3/17/92 y w tUC * VaLUe = P0l ^ ea y <i) - Val « / » * « * (teUnmm.Hu. / PolyFreq)) 
PolyValue.Value e L0G(PolySe»y<i).Vaaie) 
IF RellnvNdx.Nua > 0 THEN 

PolyTacp! = RelInvNdx(i).Nu« / PolyFreq 

ELSE 

polyteap! = 0 
END IF 

PolyValue.Value s PolyValue.Value ♦ Logtl * polyteap! - Loci 000' 
PolyValue.Value = Pol /Value. Value / Zoc21 ^ 
** Changed 6/01/92 VN ^ 
** IF RelInvNdx(i).Nu« > 0 THEN 
•* B _ s ^ >lyT * np! ° <RelInvK *«(^>-NuB / PolyFreqi * .7 

** PolyTeap! ■ .1 

** END IF 

'«= new calculation nethod 8/19/92 - Ttty 

J- TotRelVal .is the sua of relative .aloe* fro* all relative lists 
in which the word appears 
• TotRelValft * 0 

FOR jft * RellnvNdx(i). Index TO Rellnv«*xt1). Index ♦ RellnvNdxCi) Kua - 1 
EesOet RellnvOat. RellnvOatLEs; jS. RellnvimEflS^ 
' TotRelValft - TotRelValft ♦ ASCCRellnvDat.Value) 
NEXT 

Tenpft = TotRtlValg AND ftHFFFFft 
IF Teapft < 32768 THEN 

TotRelValueCi) = Teapft 

ELSE 

TotRelValueCi) = Teapft - 65534 

END IF 

IF TotRelValft > 0 THEN 

changed 8/21/92: PolyFreo is now FolyFreq * 1.2 
Poly Tempi = SQR (TotRelValft / SclyFreq - 1.2) 

PolyTeap! = .1 

END if 

PolyValue.Value » PolyTeap! » PolySeavCI). Value 
IF PolyValue.Value < A THEN PolyValue.Value a .4 
PolySeay(i). Value = PolyValue.Value 
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AvgPoly! » AvgPoly! ♦ PolyValue.Value 



ELSE 



PolyValue.Value » 0 

END IF 

PUT 01, i, PolyValue 

NEXT 

AvgPoly! a AwgPolyl / NuoReL 
CLOSE 01 

OPEN NdxOlrS * •PolyAvg.daf FOR OUTPUT AS #1 
PRINT 01, AvgPelyi 

CLOSE 01 



IF INXEYS • CHRJC27) GOTO PolySeeyDone 

SortT FolySeoyCD, Nuaftel, DESCEND. LEN(PolySeayd)), 2. -3 

OPEN LstD i rt + "PolySeny . LSr ~fOR _ aJTPUT^AS~1 IZZZI _ 

PRINT 01 r "New Value Preq 0RelLlst TotRelVal Rel/Freq QrigPoly Code Uord/Phrax*- 

-000.0000000000 00000 m»o m.*Z ^£000 00009 «• 

'PRINT 01, °PolyVolue tfbrd/Phrese* 
• "000.00000 V 

FOR i a 1 TO KuaRel 

CALL QPrintft«STR$<i), 12, 32,-1) 

PolyFreq a Ndx(PolySeay<i).Code).Nuo 

IF PolySeayO). Value > 0 AND PolyFreq > 3 THEN 

'** FGetRT RellnvNdxFile. RallnvNdx, CLNGCPolySeey<i).Code>, LEN(RellnvNdx) 

'PRINT 01, USING -000.000*0 «•; PolySesyCi). Value; Diet$<PolySe»(i) Code) 

TotRelValft = TctRelVa lue ( Po lySeey ( i ) . Code) 

IF TotRelValft « O THEN TotRelValft a TotRelValft ♦ 65536 

PRINT 01, USING "000.00000 00000 ttart* 000 000 Bam ass «hh a-. ,<s ... . 

Code; 0ictS(PolySeay<1).Code) 000R.000 00000 ft , PolySeftyd). value; PolyFreq; Rellnvftdx 

END IP 

NEXT ■ 
CLOSE 1 
PelySeayDone: 
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ELSE 

ULr c 9 

END IF 

DULr « ULr - 1 

LRr * ULr ♦ Lin ♦ 1 

DLRr = LRr ♦ 2 

horiinargin » (80 - Wid)\ 2 
ULc - horizoargin 
DULc - ULc - 3 
LRc « 80 - horizaargin 

IF (Uid \ 2) * 2 ° Wid THEN LRc ■ LRc ♦ V if it's even UIDth. then buap up the LRcoluai 
OLRc o LRc ♦ 1 

RED Iff ScrX(ArrflySixeX<DULr, DULc, DLRr, DLRc» 

CAU ScrnSaveOlDULr, DULc, DLRr, OLRc. SE6 ScrXCO)) 

CALL UindHgrCULr, ULc, LRr, LRc, 4, KomAttr, RevAttr, "Status") 

FOR i » 1 TO Lin 

CALL QPrintRCCTextSCi), ULr ♦ i, ULc ♦ 1, -1) 

NEXT 

r = ULr ♦ Lin 

C - ULc ♦ 1 ♦ LEW (TextS (Lin)) 

IF LEN (Texts ( Li n) ) + 2 = Uid THEN e a ULe ♦ 1: r s r ♦ 1 

ERASE TcxtS 
UirxfOpen s TRUE 



EXIT \ 



_ _T! close window 

HsgClose: 

CALL 3crnRestO(DULr, DULc. DLRr, OLRc, SEfi ScrXCO)) 
ERASE ScrX 
WindOpen « FALSE 
RETURN 

END sua 

SUB EssAlloc (NuePgsX, Handle*, FUeReaeS) STATIC 

IF EasPagesFreeZ < NuaPgs THEN 
CALL Chiae(8) 

EJSJ 2S U I , V l i 0Mt I "\ CL 5 6C,4uarf, 9» ) * SixteenK; - bytes of Ens C; NuaPgs; - pages) for FileNaaeS 
PRINT -Only E»a Pages Free*; " pages were available.- 
CALL ReleaseENS 
STOP 

ELSE 

CALL EesAllocAeaiSuaPgsX, HandleX) 

END IF 
END SUB 

FUNCTION LoadlntoEHS (FUfiS) STATIC 

* Returns the £r.S nandle where the f lie was loaded into - 

EHSPg » EnsGertPPSegS 
SizeofFileB - File3izt3C*UeS) 

NuaPages « Slzeof Files \ Sixteen* ♦ 2* round oft to nearest 2 pages 
EasAlloe NuaPages, Filers. Files 

Nua32kBlocks » Siteof Pile* \ Thirty TwoXS 

LeftOver8 » Si reof Files - (Nua32kaiocks * ThirtyTvott) 

FOpenAU File*. 0, 4. LOOCFILE 

BoxO 14, 8, 18, 72, 2, Re.Attr 

PaintBoxO M, 8, 18, 72, SevAttr 

FOR i s 1 TO Nua32kBlDcks • 1 

OPrintRCi^Lcadirg FileS ♦ » block" ♦ STRS(i) sm<Nua32kflTb^ks~^ f) \ 16, 10, RavAttr 

aap pages of the ENS seaory to the ENS upper oea page frame 
FOR j ~ 1 TO 2 

Easr-asfe* FileENS, j, Ci - 1) * 2 ♦ j 

IF EasE-rorX THEN PRINT "Eas error: EasErrorX' STOP 

NEXT 

seek to beg*rn1ng of current block 
FSeek Load FILE. i1 - 1) * ThlrtyTvoKa 

IP DOSErrorS THEN PRINT "Dos Error:"; UhlchErrcrX: STOP 

IF i « Nua32k2lcc«s ♦ 1 THEN 

strt the 32k block and put it directly Into the Ens page fraae 
FSetA LsadFILE. BYVAL ENSPg, BYVAL 0, ThirtyTvott 



ELSE 



IF DOSE-rorX THEN PRINT -Dos Error;-; trrorNsgS(UhichErrorX): STOP 



loae the left over <<32k) bytes 
FSetA LsadFIU, BYVAL ENSPg. BYVAL 0, LeftCverS 

IF DGS=-rorX THEN PRINT "Dos Error:-; ErrorRsgS(UhichErrort): STOP 
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FCLose Load PILE 

ClearSerO 14, 8. 18. 72, ScrmAttr 

Loadlntofns ■■MUBB 

END FUNCTION 

SUB ReleaseEHS STATIC 

IF DictUordEMS TXEN CALL EasRcUtefflCOlCtUOPtfEHS) 
IF RetinvDatEhS TH94 CALL EBsRelHemCRellnwDatEitt) 

END SUB 

SUB UindMgr <ULRow, ULCol, LB Rom, LRCol/ Frame, BoxColr, TextColr, TcxtS) STATIC 

CALL BoxOCULRow - 1, ULCol - 1, LB Row * 1, LRCol ♦ 1, fraae, BoxColr) 

CALL ClearScrOtULRow, ULCol, LRRow, LRCol. BoxColr) 

CALL OPrintRCCC" ♦ Texts + "T. ULRov - 1, ULCol ♦ 1, TextColr) 

END SUB 

DEFINT A-Z 

1 NEUXEY.8AS 



'Invoked: MEUKEY ConfigFUe 

'Creates: NewKey.Ndx, Newvat.Ndx 

•uses: Key.Ndx, Weight. Ndx. PolySesy.Dat, Kylnvrts.Ndx 

' — NeWal.Ndx contains new weights for Doc n sorted by a new weight 
1 where NewWeight = weight * (Poly " .125) 

• TYPE UeightAvgNdx127 

Weight (1 TO 127) AS SINGLE 
Rult AS SINGLE 

■ Rult is used to vary the number of sentences in the abstract program 

NewKey.Ndx contains the corresponding codes for Doe n 
' i.e., sorted by their new weights 

' TYPE KeyNdx127 

Nun AS INTEGER 

Coded TO 127) AS INTEGER 

• SINCLUDE: 1 \\vadiB\c-drive\USER\INCLUDE\TYPES.BI' 

• SINCLUDE: ■ VVvadio\c-drive\USER\IKCLUDE\DECLARES.BI' 

TYPE ValCodeType 

Value AS SINGLE 
Code- AS INTEGER 

END TYPE 

■TYPE Hu It Type 
' Code AS INTEGER 
' NUQ AS INTEGER 
•END TYPE 

CONST FALSE = 0, TRUE = NOT FALSE 

DECLARE FUNCTION LoadlntcEHSS (File*, NoERSF lagX) 
DECLARE SUB NewXey O 
DECLARE SUB Conf ig () 

DECLARE SUB EosAlloc (Nu&PagesX, Handle*, Lood FILES) 
DECLARE SUB LoadData () 

DECLARE SUB ReleaseENS () 



• COMMON SHARED Variables/Arrays • 



' Color Attributes 

CORRON SHARED Fg, Bg, Brdr, MoroAttr, RevAttr 

* constants 

CORRON SHARED ThirtyTwoKS, Slxtyfour, slxteenx. sing, cosb, BlankBltS 
1 Directories 

COJWON SHARED NdxOIrS, DocDI rS, KeyDlrt, LstDirJ. AbstrDIrS 

• Data file sizes and EflS handles 
COnnON SHARED KeyMdxFILE 

coimON shared PalyseayERS, Polyvalue AS PolyValueType, Poi/Len 
CORRON SHARED KeyEnS, UelghtERS. PolyEHS 
C0NRON SHARED KeyFHe, WelghtFlle, PolyFlle 
CONRON SHARED VelghtTesp AS weightNdx127, WeitfttLen 
CORRON SHARED KeyTecp AS KeyRdx127, KeyLen 
CORRON SHARED Va I Array AS WeightAvgHdx127, VaUen . 
CORRON SHARED LenTero, NewKeyFUe, NewValueFile, NewVelueERS' 
CORRON SHARED PolytfecRSFleg, KeyttoEHSFLag, WeightNoERSFlag 
• — RAIN I " 



CALL Conrfig 

ValLen = LEN(ValArray) 
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files s NdxDirS + "NewKey.ndx" 
FOpenAU PUe$, 1, 4, NevKeyFile 

if the file wasn't there, create it 
IF NevKeyFile = -1 THEN 
FCreate FileS 

FOpenAU FileS, 1, 4. NevKeyfUe 

END IF 



Files s NdxDirS ♦ itewval.ndx" 
FOpenAU files, 1, 4. NewValuefUe 
'— if the file wasn't there, create 1t 
IF NewValueflle = -1 Then 
FCreate Files 

FOpenAU FileS, 1, 4, NewValueFHe 

END IF 

QPrintRC "NEUXEY IS WORKING". 2, 30, -1 



CALL NewXey 
CALL ReleaaeEHS 



SUB Config STATIC 

IF COMMANDS «> THEN 

ConfigFileS * COMMANDS ■ 
SpcLec = INSTRCConfigFileS, ■ "3 

IF SpcLce THEN ConfigFileS a UFTS(ConfigFile$, SpcLoc - 1> 
ConfigFileS = ConfigFileS ♦ ".CFG" 
ELSE . 
Chine 5 

PRINT -Usage NEUXEY CdatanaaeJ." 
STOP 

END IF 



IF NOT ExistCConflgFlleS) THEN 

PRINT ConfigFileS; " was not found." 
PRINT "Press any key to continue." 
DO: LOOP UNTIL LEN(INKEYS) 
END 

END IF 

OPEN Conf igFi leS FOR INPUT AS « 

INPUT ffl, Fg, Bg, Brdr, UtDirS, DocDirS, NdxDirS, AbstrDirS. uangS 

CLOSE #1 



NornAttr = OneColorttFg, Eg) 
RevAttr - OneColorXCBg, Fg AND 7) 
Sixteen* » 16384 
ThirtyTvoKa - Sixteen* * -28 
CLS 

IF NOT EBSloadedX THEN 
Chine 8 

PRINT "The Ens r.« not been loaded. 8 
STOP 

END IF 
END SUB 

SUB EasAlloc <NunPages2, HandleZ, LoadFlLES) STATIC 

CALL EfisAllocnenCNusPages:, HandleZ) 
IF BnsErrorZ THEN 

PRINT "Couldn't allocate"; CLNG(NuaFages> * Sixteen*; "bytes of ENS for "; LcasFILES 
PRINT "Use disk space." 
Handle = 0 

END IF 



SUB Load Data STATIC 

• Load Weight. Ndx into ENS 

FileS = NdxOirS + fWEIGHT.NDX" 
IF NOT ExiatZ(FileS) THEN 
CLS 

PRINT FileS; " not found." 

CALL ReleaseENS 

STOP 

END IF 

weight Len = LENCtfeightTeap) 

UelghtEKS = LoadIntoENS( Files. UelghtNoENSFlag) 

IF weigh tncEHSF lag then 'use disk 

FOpenAU FileS. 0. 4. UeightFile 

END IF 

1 Load Key.Ndx into ENS 

FileS » NdxDirS "KEY.NDX" 
IF NOT Exist* (FileS) THEN 
CLS 

PRINT FileS; " not found." 

CALL ReleaseENS 

STOP 

END IF 

KeyLen ° LEN(KeyTeep) 

KeyENS = LoadlntoEMSCFiltS. KayttoERSFlag) . 

IF KeyNoEHSFlag THEN 

FOpenAU FileS, 0, 4 f KeyFile 

END IF 
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*■ Load PolySeay.DAT into EMS 

FileS » NdxDirS f "POLYSEJfY. DAT" 
IF NOT ExlstZ(FUeS) THEN 

as 

PRINT FileS; ■ not found." 

CALL Releasees 

STOP 

END IF 

DIM PolyVol AS PolyValueType 
PolyLen - LEN<PolyVal> 

PolyEHS • LoadlntoEnS< FUeS, PolyNoEttSFleg) 
IF PolyNoEHSFlag THEN 

FOpenAU Files, 0, 4, PolyFUe 

END IF 

•NuoOocS = FileSizeS<NdxOirS ♦ "KEY.NDX") / KeyLen 

'NumPages ■ NuaOocS • CLNGCValLen) \ Sixteen* + 2 1 round off *o nearest 2 pages 
•EosAUoc NunPages, NewValueSHS, "NewVal.Ndx" 
'FOpenAU FUeS, 0, C, NewValueEflS 
END sua 

FUNCTION LoadlntoENS (FileS r NoEHSFlag) STATIC 



Returns the handle where the file was loaded into • 



NoEHSFlag = FALSE 
ENSPg = EmsfietPFSegZ 
SizeofF1le& = FUeSizettFileS) 

NusPages = Si zeof FileS \ SixteenK * 2' round off to nearest 2 pages 

EasAUoc NunPages, FileENS, FUeS 

IF FileENS * 0 THEN 'not enough ENS use disk instead 

NoEHSFlag * TRUE 

EXIT FUNCTION 

END IF 

Nua32lcBlocks * SizeofFilei \ ThirtyTWoK& 

Leftover* = Si zeof FileS - <Nu«32kB locks • ThirtyTwott) 

FOpenAU FileS, 0 r 4, Load FILE 

FOR i * 1 TO Nuu32kBlocks * 1 

BoxO 11, 5, IS, 70, 2, RevAttr 

•PaintBoxO 11, 5, 15, 70, RevAttr 

OPrintRC "Loading • ♦ FileS ♦ » block" ♦ STRSCi) ♦ " /- * STRS<Nun32tB locks ♦ 1), 13, 12. RevAttr 

'— map pages of the ENS oeaory to the ENS upper aea page frame 
FOR j s 1 TO 2 

EnsHapHea FileENS, j, (i - 1> * 2 ♦ j 

IF EosErrorX THEN PRINT "Eas error:"; EasErrorX: STOP 

NEXT 

*— seek to beginning of current block 

FSeek LoadFILE, Ci - 1) * ThirtyTwcKS 

IF DOSErrort THEN PRINT "Dos Error:"; UhichErrorX- STOP 

IF i < Num32kBLocks ♦ 1 THEN 

get the 32k block and put it directly into the ENS page frane 
FSetA LoadFILE, BYVAL SNSPg, BYVAL 0, ThirtyTwoKa 



ELSE 



IF DOSErrorS THEN PRINT "Oos Error:"; uhiehErrorX: STOP 

■— load the left over <<32k> bytes 

FGetA LoadFILE, 8WAL ENSPg, BYVAL 0, Leftover* 

IF DOSErrorX THEN PRINT "Dos Error:"; UhichErrorX: STOP 



END IF 

NEXT 

FClose LoadFILE 
CLS 

LoadlntoENS = FileENS 
END-FUNCTION 



SUB NevKey STATIC 

RED IN AvgRatioKI TO 127) 

AvgV30l = 0 

DocNuoi = FileSizeS(NdxOirS ♦ "Key.ndx") \" LENOCeyTecp) 
FOR RecNuoS = 1 TO OccNusS 

OPrintRC ' Document i* * STRSOtecNuog), 10, 30, -1 
IF KeyNoEHSFlag THEN 

^ FGetRT KeyFUe, KeyTemp, RecNuaft, KeyLen 
EosGet KeyTeop. KeyLen, RecNuag, Key ENS 

END IF 

IF uelghtNoENSFlag then 

FGetRT ueightfUe, UelghtTeop, RscKua£, UeightLen 

EssGet ueightTeop, WaightLen, RecNua& ( UaightENS 

END IF 

IF KeyTenp.Nua =» 0 THEN Key Temp. Rua = 127 

REOIN TeopArrayCI TO KeyTenp.Kua} AS valcodeType 

FOR i » 1 TO Key T eap.Nua 

IF PolyNcEHSFlag THEN 

FGetRT PolyFUe, Polyvalue. ONGCKayTeap.CodtCi)), PolyLen 
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EasGet PolyValue. PolyLen. CLNG(KeyTeffp.Code(1)), PoLyEHS 

END IF 

TeapArray(l). value ■ UeitjhtTeap.ueigh«1> • Polyvalue. Value * .125 '(i.e. *1/8) 
TeapAr ray (1). Code = KeyTeap.CodeCD 

NEXT 

SortT TecpArrayO), KeyTeap.Nun, 1, LEMTeapArrayCI)), 0. -3 

KeyTenp.Nua = KevTeap.Nua 
FOR k s 1 TO KeyTenp.Nua 

KeyTrap.Code<k> ■ TeepAf ray (k). Code 

VaLArray.VeightdJ - TeopArraytt) .Value 

NEXT 

FOR f • k TO 127 

KeyTeap.CodeCf) - 0 
ValArray.u e ight<f) « 0 

NEXT 

VelArray.Hult » 1 

FPutftT HevValueFile, ValArray, RecNuaft, ValLen 
FPutRT NewKeyFile, KeyTetsp, RecNuag, KeyLen 

IF INKEYS - CHS J (27) THEN EXIT SUB 



HEXT 
CLS 

PRINT "Everything is CK" 



FClose HewXeyFile 
FClose NewVaLueFile 

end sua 

SUB ReleaseEHS STATIC 

IF Kay EMS THEN EasRelnea KeyEW 
IF weightEfiS THEN EasRelHea uelghtEHS 
IF PolyEHS THEN EasRelNea PolyErtS 
IF KeyFUe THEN FClose KeyEHS 
IF Ue1ghtF1le then FClose uelghtEHS 
IF PolyFUe THEN FClose PolyFile 

END sua 

OEFINT A-I 
'CHANGEFt.BAS 

• Invoked changefl ConfigFile 
'Creates: Kylnvrta.Ndx, Rel-Invs.Ndx 

• TYPE SaallHdxType 

index AS LONG 
Nun AS INTEGER 
•Uses: Ky Invert. Ndx, Rellnv.Nox 

• TYPE NdxType 

Code AS INTEGER 
Index AS LONG 

• Hua AS INTEGER 

•saxes copies of Ky Invert. »dx ft Ral-Inv.Ndx without the Coda field 

•SINCUttE: '\USER\INCLUDE\TYPES.Br 
•SINCUUDE: AWER\D*ajUDE\dECUUJES.BI' 
IF COHNAKDS <> " THEN 

ConflgFileS » COMMANDS 
SpCLoc » UiSTR<ConfigFUeS, " ") 
IF SpcLoc THEN* Conf igfileS a LEFTJCConligf ileS, 
ConfigFileS = ConfigFileS ♦ ".CFG" 
ELSE 

Chime 3 

_ _ PRINT "Usage CNANGEFL Cdatanaae}.^ 

END 

END IF 

OPEN ConfigFileS FOR INPUT AS *1 

• UiPUT #1, Fg, Bg, Brdr, LstOi rS, OocDirS, NdxOirS, AbstrOirS, UngS 

CLOSE n 
OPrintRC "CHANGEFL", 1, 37, -1 
fileS • NdxDirS ♦ "REL-INY.NDX" 
IF NOT ExIstXCfileS) THEN 
OS 

PRINT files; " was not found." 
END 

END IF 
NdxLEN = 8 

FreqCoopKua * FileSizeS(flLeS) \ NdxLEN 

RED in TeepFCIndxCI TO FreoCoapKua) AS NdxType 

DIN FCIndx AS SaallHdxType 

FGetAH files, TeapFCIndxO), NdxLEN, FreqCoapNuo 
CLS 

OPEN KdxDirS ♦ -Rel-InvS-NDX- FOR RANDOM AS ff1 LEN « 6 

FOR i = 1 TO FreqCoapNua 

FCIndx. Index = TeapFClndx(i). Index 
FCIndx.Nua » TeapPCIndxCO.Nua 
LOCATE 10, 25 
PRINT 1 

PUT #1 f i» FCIndx 
ERASE Tespf Clndx 

close in 
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Load the Inverted Data Files - - 



flleS » NdxDIrS ♦ ■1CVINVERT.NDX ,, 
IF NOT ExistX(flleS) THEN 
CLS 

PRIMT fileS; • was not found.* 
END 

END IF v 
NuaXeys - FUeSixe&f UeS) \ NdxLEM 
REDIM TenpJCYtndxCI TO Kuafteys) AS NdxType 
FGetAH filat, SEG TeapiCYlndxCI ) , NdxLEN, HusKsys 
Din KYIndX AS SnalUWxType 

OPEN NdxDi rS ♦ "mNVRTS.NOX" FOR RANDOM AS 01 LEN = 6 
FOR 1 a 1 TO NudCe/S 

KYIndx. Index - TeapKYIndxCi). Index 

KYIndx.Nua = TeafnOrindx(i).ttun 

LOCATE TO, 25 

PRINT 1 

PUT #1,1, KYIndx 

NEXT 

ERASE TenpKYlndx 

as 

CLOSE 01 



OEFINT A-l 

•SXNCLUOE: , \US£RUNCLUOc\TYPES.BI , 
•SINCLUOE: •\USeRVINCLUOS\DECLARES.BI' 

CONST HaxText = 2500, ASCO t» 48, ASC9 « 57, FALSE « 0. TRUE = NOT FALSE 
CONST SixteenK = 16384, Thirty TvoXS - 32768 

TYPE Urdlen 

Urd AS INTEGER 
Len AS INTEGER 

END TYPE 

DECLARE FUNCTION GetSentVaiuel (SentNua, NuaUord, Value AS UeightAvgNdx127. KeyTeen AS <eyNdxl27) 

DECLARE FUNCTION InstrTbli (Start! , SourceS, Char*) ^ynaxicj 

DECLARE FUNCTION LoadlnteEl SX (FileS) 

DECLARE FUNCTION ExtractWerdS (Sources, CharS, Start!) 

DECLARE SUB ChangeChar CTxtS, KeepStrS) 

DECLARE SUB Conflg (Hach1r.eS, Beg&, FlnS) 

DECLARE SUB Chopping (Best) 

DECLARE SUB Cut Sentence (EestSent, SentS, NuaUord) 

SkuM sIS tosEr^S?er(j KeyTeOP * KeyMX127 ' *"S""P « CcoouiatC) AS UordCode) 

DECLARE SUB Extract Doc (DccNuoft, TotSentNuaX, Handler, NoTextFlag) 

OECLARE SUB ExtractFullText (TxtEMSX, FullEKSX) 

DECLARE SUB Extract Sent (SentttuaX, SentS, LenSenX, HandleX) 

DECLARE SUB Extract UordNm (SourceS, UordKunX, StartX, SlenX) 

OECLARE SUB EasAlloc (NuaPagesX. HandleX, Load FILES) 

DECLARE SUB FiUScrnO (UL, LC, BL, RC, Coir, Char) 

DECLARE SUB FindConbKey (iterdListSO, NurttordX, NucXeyVordFoundX, SentNua, HunKey, Cwtf.iatO AS UordCode) 
DECLARE SUB FindSingKey (VcrdListSO, NuoUordX, NuxXeyVordFoundX, SentNua! Sine.iJtO AS UordCode) 

DECLARE SUB GetKUList (DocfuaS, SingListO AS UordCode, CcnbLlstO AS UordCode, KeyTe* AS Key»dx127) 
DECLARE SUB LoadDeta () 1 
OECLARE SUB Rank (Best Sent, HcHorefleg, NuaXeyUordsO, KunWordO, value AS UeightAvgNdi^, KeyTeen AS KevNdx127) 
DECLARE SUB Read£ngliahTe»t (FlrstLineB, Lasttineft, HandleX, TotSentNuaX) WyNdXTZr) 
OECLARE SUB ReedGeroenText (FirstLlneft, LostLineS, HandleX. TotSentNuaX) 
DECLARE SUB ReleaseEHS () 

DECLARE SUB ReedSection (TxtS, SeeArraySO, ArtArraySO) 

DECLARE SU3 UordParse (SentS, LenSen,- SentNuaX, UordUstSO, UordsX) 

DECLARE SUB Ur iteSentenco (SentS, Best, AbstrPosft) 



• COMMON SHARED Variables/Arn 



common SHARED Fg, Eg, Brdr, NoroAttr, ftevAttr 

COMMON SHARED LstOIrS, HdaOirS, DocDIrS, AbstrOirS, AtlistS, LangS 

COMMON SHARED DictCodeNua^, DlcttodeHX, KeyEMS, ValueEMS, NuisCcab NucSina 

COMMON SHARED DocHdxFi Le, Dec File, AbstrKdxFlle, AbstrFila, TextNuaft 
COMMON SHARED KeyUordFeundNdxO AS UordNdxType. KeyUordFoundO AS LiatType 
COMMON SHARED SentNdxO AS SentNdxType. TextArrayO AS TextType, TotSentitua 
COMMON SHARED KeepNoNunber*S, Keeps, AbbrevlSO. BoiseSO, AbbrevEngltO 
COMMON SHARED SectlonSO, ArtlcleSO, ParagraphSC), ArtlkeWO, NinbersSO 
COMMON SHARED HeanPref ixesSO, pref IxesSO 

. RAIN MODULE , — . 

_CALL Conf igUtaeMneS, Beg&, Fin&) ' 

CALL LoadData 

FileS = OocOtrS ♦ VNDX" 

FCpenAll FUeS, 0. 2, OotNdxFlle 

FileS a OocDirS ♦ ".TXT" 

FOpenAll FUeS, 0, 2. OotFlle 

Files ■ AbstrOirS + ".NO" ♦ Machines 

FCreate FileS 

FOpenAll FileS, 1, 4, AbstrNdxFile 
FileS = AbstrOirS ♦ *.TX" «• Machines 
FCreate FileS 

FOpenAll FileS, 1, 4, AbstrFUe 

CountAlU » FileSiteKDocDirS ♦ VNOX") \ 8' number of records (I.e., tiLes) 
'ON ERROR GOTO ErrorKanoler 

DIN Abstr AS ISAMtype 
LenAbstr s LEN (Abstr) 
AbstrPos* c i 
alltf » TWER 
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FOR DocNuai* Beg^TO Fir* 'CountALl* 1 IN REAL PROGRAHN TO C0untAll& 
NumKey a 0 

CALL tctfactOocCDoemiaa, TotSentHua, TxtENSI, NoTextFlag) 

es£S>° false: ^ - « «» - »«• «■ — *><• -.«-. sw «. «.« 

RED IB NuaKeyUordsn TO TotSentHua) 

REDIR KeyUOrtFouKlMxCI TO TotSentNuo) AS WordNdxType 

RED in KeyUrjrdFoundO TO 1) AS ListType 

REDIH SlngLlstn TO 1) AS UordCode 

REOin CoebListCI TO 1) AS UordCode 

01N KeyTasp AS KeyNdx127 

CALL 6etlOVList(DoeNuo&, SingListO, CoabListO, KeyTeap) 
UL = 6: LC = 25: 3L = 9: RC = 55 
BoxO UL, IX, BL, RC, 2; RevAttr 
PaintBoxO UL. LC. BL, RC, RevAttr 

GPrintRC "Oocuaent " ♦ STR$(0ocKun4), UL ♦ 1, l> ♦ 9, RevAttr 

QPrintRC "Total Sentences - ♦ STRSCTotSentRua), UL ♦ 2, LC ♦ 5. RevAttr 

UL ■ 12: LC a 30: BL = 17: RC 3 SO 

BoxO UL, LC, BL. RC, 2, RevAttr 

FOR SentNun * 1 TO TotSentHua 'How cany sentences? 

CALL ExtractSentCSentNua, SentS, LenSen, TxtEKSZ) 

REDIn VordListSO TO 1) 

• SSr: PES' ^*«<>, ^(SentNu,)) . 

P^fc«^(WDrdLi 5 tS(>, rtuauordCSentNua). lka*feyUor*CSentftua), SentNua, NuoXey, CcabLlstO) 

S« r - nt ^ IS"!™ I * STRS<Sent«ua), UL% 1. LC ♦ 5, RevAttr 
CPrintRC *Words - ♦ STRS(Nua^rd<SemNua)). UL ♦ 2, LC ♦ 5, RevAttr 



ERASE UordListS 
END IF 

NEXT 

•erase text witnout punctuation and extract the full text 
CALL ExtractFullTextCTxtENS, FullEJIS) ■ 
—create abstract and write it onto the disk 
CLS 

UL o 4: LC 3 25: 3. 3 7: RC a 55 
BoxO UL, LC, BL. RC, 2, RevAttr 
PaintBoxO UL, LC, BL, RC, RevAttr 

QPrintRC "Docuaertt " ♦ STRSCDoeNuaa), UL ♦ 1, LC ♦ 9, RevAttr 

QPrintRC "Total Sentences" ♦ STRStTotSentNua), UL ♦ 2, LC 5, RevAttr 

LENTxt = 80 ' " 

Abstr. First = AtstrPosft 

Honor ef leg = FALSE 

DIM UordValue AS uetghtAvgf<dx127 
EnSGet UordValue, LEHCUordValue), DocNusS, ValueEHS 

FSetRT Value EMS, UordValue, DocNuaS, LEN( UordValue) 
OPEN "Origin" FOR OUTPUT AS 08 
OPEN "New" FOR OUTPUT AS #9 

DO 

REDIH KeyVordsSd TO 1) 

^L^'^^'c^^^ 89 ' ■"WrdaO. »™*«K>, UordValue, KeyTeap) 
IF Nonorerlag THEN EXIT DO 
. CALL Extract Sent (Beat Sent, SentS, LenSen, FullEHSX) 
PRINT 08, SentS 

CALL Cut Sentence (Best Sent, SentS, NuaVord) 
CALL UriteSentenceCSentS, Beat Sent, AbstrPos&) 
CALL Chopping (BestSent) 

LOOP UNTIL Noftoref lag 
CLOSE 18, #9 - - 

Abstr.Last = AhstrPosft - 1 
FPutRT AbstrNdxFUe. Abstr, DoeHuaft, 8 



IF TextNuaft > RaxText THEN 

CALL EasRelneeCFullERSX) 

ELSE 

ERASE TextArray 

FreSpeceT- FREC*"*) ~ ~ " 

END IF 

chtS a IKKEYS 

IF LEFTS (chtS, 1) * CHRS(27) THEN 
GOTO Finishuork 

END IF 

fdXtOoc: 

CALL FFlushtAbstrFile) 

CALL FFlusMDocNdxFile) 

CALL FFUsMOocMdxFHe) 

NEXT 'docuaent 

PinishVork: 

a12# ■ TIMER - all* 

H ■ e12# \ 3600 

n • Ca12# noo 3600) \ 60 

a - CCe12# HOD 3600) HOD 60) 

QPrintRC "Total : - ♦ STRSCH) ♦ " H ■ ♦ STRS(H) ♦ - a - ♦ STRSCa) ♦ ■ »• -to » a-^e-,- 

QPrintRC "Total : " ♦ STRSCe12f>, 20, 25, NoraAttr Nen*:tr 

FClose AbstrNctxfile 

FClose Abstr File 

FClose OocNdxFile 

FClose DocFile 

RaleasaENS 

END 'prograa 

4 ; END OF RAIN NODULE ■ „ 
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DATA 



GeraanAbbrevf ature : 

DATA "2B". ,, ogt-,"1nkl". ,, .gl","grunds". , *ausschL","einschL* 
DATA •lCl","Beltl", , 'Nr• 4 "Ge^- # "Be^Se^", , Tf , •,•subJ" <> "obj , • 
Engl Abbreviature: 

DATA "or" . "ara" , "eeasrs 0 , *sen" , "rep" , "as", "or - , "drs" 
Engl Sections: 

DATA "Section", "Sec", "Sec." 
* DATA •Art1cteVArt # ,Ari.* 
Ge roan Sections: 

DATA •Paragraph", "Par" , "Par. », "Pare. ■ 
DATA "Artikel","ArtVArt. n 
NuaData: 

data •jvnvmv^'vvvi^uvviiivwvx" 

data "xivxiivxii^-xivvxv", "m", "mi*, "xviir, "xixvxx" 

data •xxI","mI tt ,•xxIII^ w xxIV",•xxv , ^"mI","mII■,"xwIII",•xxIX", ,, xxx• 

'SINCLUOE: •XuserXincludeiprefixea.bi' 

SU3 ChangeChar CTxtS, KeeoStrS) STATIC 

'— replace all chars except contained in Keeps 
NewS a « « 

FOR i = 1 TO LEN(TxtS) 

IF INSTRCKeepSirS. NIDSCTxtS, i, D) a 0 THEM 
HIDSCTxtS, t, 1) a NewS 

END IF 

NEXT 



SUB Chopping (Best) STATIC 

' delata lew vhicn 1$ found already fron the tcwlist 

NuaCodes = Keyword Poundftdx (Best) .Last - KeyVordFcundNdx(Best) . Fi r*t ♦ 1 
REDIrt CodeOeleted TO NuaCodes) 'array of the codes which should be deleted 

j = 0 

FOR i a KeyUordf3undNdx<Best). First TO KeywordFeundNdx(Best).Last 
IF KeywordFoundCi).Code •*> 0 THEN 

j » j ♦! 

CodeDelete(j) • KeyWordfound(i).Code . 

ELSE 

NuaCodes =» NumCodes - 1 
END IF " - 

NEXT 

FOR i = 1 TO UBOUND(KeyUordFound) 

CALL Search (SEG Cad«Delete<1), NuaCodes, KeyWcrdfound<1).COde, Found. 0 0 -1) 
IF Found «> -1 THEN 'delete this word ' 
1 KeyUordFoundCi).Code a 0 

KeyUordFound<i).Nuo a 0 
END IF m 

NEXT 

END SUB 

SUB Config (NachlneS, BegS. F1n&) STATIC 
CodS ^ RTRin$CLT8IM(COWlANDS)) 
Paras = InCountCCadS, * "> ♦ 1 

IF Panas = 4 THEN 

Expected inforaatlon on cooaand line: 
1 Config file. First Doc, Last Doc 

Extract CndS, " ", 1, Strt, Slen extract first para 
DBNaaeS - MDSCCadS, Strt, Slen) 
ConfigFileS « DBNaaeS + ".CFG* 

Extract CadS, " \ 2, Strt, Slen »— extract second paro 
HachinaS a MDSCCadS, Strt, Slen) 

Extrart CtodS, _"_ B ,_3,_Str_t,_Slen_ , -^extreet_third-parB— 

Beg& a QPValLS(NIDS(CadS, Strt, Slen)) 



ELSE 



Extract CadS, ■ 4, Strt, Slen extract fourth pars 
F1n& = QPValLKHiDStCndS, Strt, Slen)) 



PRINT 

PRINT "ABSTRACT Program Error: Hissing Parameters' 

PRINT 

PRINT 

PRINT "Required Parameters are:" 
PRINT 

PRINT "Abstr Config File First Doc Last Doc* 

PRINT 

Chime 10 

PRINT "Press the SPACE BAR to exit:" 

iS = WPUTS(I) 

ENO 

END IF 

OPEN ConfigFileS FOR INPUT ACCESS READ SHARED AS #1 

INPUT #1, Fg, Bg, Brdr, LstDi rS, DocOirS, NdxDirS, AbstrDirS, LangS 
CLOSE #1 

OPEN LstDirS ♦ "3" ♦ LangS ♦ ».LST" FOR INPOT AS #1 
LINE INPUT 41, At^istS 

CLOSE n 

NoreAttr a oneColorKFg, Bg) 
RevAttr = OrteColorZCBg, Fg AND 7) 

as 
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' repLacentnt string for punctuation 
Keeps ■ ■ ■ - 
KeepNoNuaribersS = " " 

FOR j « 65 TO 90: KeepS - KeepS ♦ CHRSCj): NEXT 
FOR j = 97 TO 122: KetpS a Keeps ♦ CHRSCj): NEXT 
KeepS = KeepS ♦ *'& m 
KeepKoNunbersS = KeepS 

FOR j = 43 TO 57: KeepS = Keeps CHRSCj): NEXT 
RESTORE GereanAbbreviature 
RE01H AbbrawlSd TO 15) 
FOR i » 1 TO 15 

READ AbbrevU(i) 
NEXT 
i 3 1 

REDIN NoiseSCI TO 1) 

If LangS = "GERMAN" THEN 

OPEN LstDirS ♦ "N0ISE.DAT" FOR INPUT ACCESS READ SHARED AS #1 

DO UNTIL E0F(1) 
RESIN .PRESERVE NoiseSCI TO i) 

INPUT #1. NoiseSCi) 

n$ e LEFTS (NoiseSCi), LENCNoiseS(i)) - 1) 
NoiseSCi) = UCASESCLEFTSCnS, 1» ♦ HIDSCnS, 2) 
i * i ♦ 1 
. LOOP 
CLOSE n 
END IF ' 

IF HOT Efl3 loaded* THEN 
Chi«e 8 

PRINT The ENS Driver has not been Loaded.'* 
STOP 

END IF 

DIN AbbrevEnglSCI TO 8) 
FOR i => 1 TO 8 

READ AbbrevEnglSCI) 

NEXT 

DIN Sectienld TO 3) 
FOR 1 = 1 TO 3 

READ SectionSCi) 

NEXT 

DIN ArtieleSd TO 3) 
FOR 1 = 1 TO 3 

READ ArtULeSCJ) 

NEXT 

DIN Paragraphs 11 TO 4) 
FOR 1 o 1 TO 4 

READ Paragraphs'^) 

NEXT 

DIN ArtikelSO TO 3) 
FOR i - 1 TO 3 

READ ArtikelSCO 

NEXT 

DIN NuabersSCI TO 30) 
FOR i - 1 TO 30 

READ KunbersSCi) 

NEXT 

IF LangS = "GERMAN" THEN 

RESTORE 6«roanP-efixes 

ELSE 

RESTORE Engl1sh»-efixes 

END IF 

REDIN Pref1xesSC2 TO 9) 

IF LangS a "GERMAN" THEN 
FOR 1 a 2 TO 9 

READ F'.rstrtalfS. SecondHalfS 
PrefixesSCi) » FirstHaLtS ♦ SecondHalfS 

NEXT 

-ELSE— 



FOR i « 2 TO 9 

READ PrefixesSCi) 

NEXT 

END IF 

REDIN NeanPrefixesSO 7C t4) 

IF LangS - "GERMAN* THEs 
FO R i » 3 TO 14 

READ FirstHalfS, SecondHalfS, ThirdHalfS 
HeanPrefixesSCi) « FirstHalfS ♦ SecondHalfS ♦ ThirdHalfS 

NEXT 

ELSE 

FOR 1 = 3 TO 14 

READ ReanPrefixesSCI) 

NEXT 

ENO IF 
END sua 

SUB CutSentence CBestSer.t, SentS, NusUord) STATIC 

'this procedure leaves in the sentence only keywords with 30 characters of 
. 'both sides but always full word and changes the rest with "..." 
' 1 . collect the keyworcs into array 

• 2. Find this" lew's In the sentence 

• 3. Take 40 characters (but full word) of both sides end if there is 

something else then cut it 

n a 0: OldStrS ? " . 
'nuve rashers- into array 
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KuaGodes = KeyttordFoundNimBestS*nt).last - KeyUordFoundNdx (Best Sent). First f 1 
REDIM KeyvordsS(1 TO KuoCodes) 

FOR i • KeyVordFcundNdx(BestSent). First Tjo KeytterdFouxlNdx(BesTSent>.Last 
If KeyUcrdFound<i).Nun <> 0 THEN 

IF INSTR(OldStrS, RTRIM<KeyUordFoundC1).Str) ♦ • •) « 0 THEN 
n ■ n ♦ 1 

KeyVordsS(n) » RTRIM(KeyUcn!Fcund<1).Str) 
OldSvS = OldStrS ♦ KeyVordsSCn) ♦ ■ » 

END IF 

END IF 

NEXT 

RED IN PRESERVE KeyUordsSd TO n) 

'for each kw find position 
SentC leans = SentS 

CALL ChangeCharlSentCleanS, KeepNottuabersS) 
IF LangS = "GERMAN" THEN CALL Lower (Sent Cleans) 
NusKey = 0 

REDIM KeyStartCI TO KuoCodes) AS UrdLen 
FirstTiee = TRUE 
FOR i = 1 TO n 

Start = 1 

DO 

NuaKey a NuJsXey ♦ 1 

•check the first word with capital Utter before 
IF tangs = "German- then 

KeyStart(NuaKey).Vrd « iNSTRCStart, SentCleanS, KeyUcrdsSCi) ♦ ■ ») 

ELSE 

IF FirstTioe THEN 

^ ^ KeyStarttNuoKey).Vrd * INSTR<$entCleanS, RTRl-SCUCASES(UFTSOceyUortJsS(1>. 1)1 ♦ HIDSCKeyttord 

' IF KeyStart(Nu«Key).Urd = 0 THEN 

KeyStort<NumKer>.«rd * lHSTR<Stert, SentClearS. KeyWordsS(l) + • •) 

END IF 

•If still not found try add ■"a" to the end of the firs: word 
IF KeyStart(HunKey).Wrd ■ 0 THEN 

SpLoc « iNSTROCeyUordsSCi), • ■) 

IF SpLoc THEN 

ELSE " LEFTSCKeyyordsJ(i> ' 'P 1 ** " t) - T'S" ♦ HIDSOCeyUordsSCi), SpLoc) 

WIS a KeyUordsS(i) ♦ •"a" 

END IF 

KeyStart(Nu»Key).Urd = INSTR«tort, SentCleenS, KUS) 

END IF 

FirstTiae = FALSE 

310 IF 

Ke/Start(NuBKey).Len = LEN(KeyUordsSCi)) 
IF KeyStarc(NusKey).Vrd <> 0 THEN 

HIDSCSentcLeanS, KeyStart(KuaiCay).yrd, ICeyStarttNuaKeyJ.ien) « STRIN6SCKeyStart<NuaKey}.Len ( 32) 

Start = KeyStart(Nuaxey).Wrd ♦ Key$tart(NuDCey).Len 

END IF 

LOOP UNTIL Keys tart (NuraKey).tfrd 8 0 OR HusKey = ituaCodes OR Start » LEN(SentS) 
IF Key Start (KuoKey).Vrd = 0 THEN NuaKey = NuaKey - 1 
IF NuaKey = KuoCodes THEN EXIT FOR 



CALL SortTCSEG XeyStart(l), NuaKey, 0, LEW (Key Start (1 ) ) , 0, -1) 
j » 0: Shift ° 0 
'cut the first part 
IF KeyStortm.Ord > 40 THEN 

Del Point • QlnstrBOCeyStartCI > .Urd - 40, SentS, ■ "> 
IF DelPoint <> 0 THEN 

SentS «•..."* HIDS(SentS. DelPoint! 
Shift « DelPoint - 3 

END IF 

END IF 

'out the oiddle parts 
FOR i o 1 TO NuaCey - 1 

IF KeyStartli + U.Urri - KeyStartCD.urd ♦ KeyStart<1>.Len > 60 THES 

DelPo1ntStart_? INSTR(KeyStart(i).urd-*-KayStart(i)Aen -♦ 40 - Shift;- SentS, *•") 

Del Point End 3 QlnstrB(KeyStart(1 * D.vrd - Shift - 40, SentS, " "> 
IF DelPolntStart <> 0 AND DelPolntEnd > DelPoint Start THEN 

SentS » LEFTS (SentS, DelPoint Start) HIDSlSentS, Del Point End) 

Shift = Shift ♦ DelPolntEnd - DelPolntStart - 3 

END IF 

END IF 

NEXT 

'cut the last part 

IF KeyStart(Nus9(ey>.Urd «> 0 THEN 'over insurance 

IF LEN(SentS) - <KeyStvt( NuaKey ).Urd ♦ KeyStartCRusXeyl.Len - Shift) > 40 THEN 

DelPoint a !NSTR(ICeyStartCNuaCey).Urd f KeyStart(Ku»Xey).Len - Shift ♦ 40, SentS, " •) 
IF DelPoint «> 0 AND DelPoint « LEh(SentS) THEN 
SentS = LEFTS (SentS, DelPoint) «•-...• 

em if 

END IF 



SUB DictSortSearch (KeyTeep AS Keyffdx12?, KeepNua, SingListO AS Word Code, CocbLlstO AS vordCode) STATIC 

Binary searches thai en array of DictType (word & code*) for e 
*«* word end returns the code f . The EHS Handle and the nuaoer of 
•*» Diet Code entries are Ceaecn Shared. 



DIR DictTeap AS DictType, 0lctTeap2 AS DictType 
LENDict - LEN(DictTecp) 
FOR i • 1 TO KeepNua 

Code -KeyTeap. Coded) 

I • 1: R = DictCodeNuoX ' total raster of coda entries 
00 
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X * (CLNGU) + R) \ 2 

Ens6et1El DictTenp, LENDict, x, OictCodeHZ 
IF Cocc * OictTeop.Code THEN 

a ■ x - i 

ELSE 

I? Code <> Diet Teep. Code OS LangS <> "ENGLISH" THEN 
I a x ♦ 1 

ELSE 

save it teaporarily in ease we need to restore it • 
• if the forward/backward Look doesn't bring any 
1 positive results (i.e., we didn't get a eetch, enc -e 
' need the original DictTeap for the LOOP test) 
SUAP DictTeap2, DictTeap 
EasGetlEl PictTeap, LENPict, x + 1. OictCodeHX 
IF Code a OietTeap.Coda then 

X « X * 1 
EXIT DO 

ELSE 

EasGetlEl OtetTeop, LENDict, x - 1, OictCodetC 
IF Code = PictTeap.Cede THEN 

X = X - 1 

EXIT 00 

END IF 

END IF 

. SUAP DictTeap, DtctTeap2 
I « I ♦ 1 

END IF 

END IF 

LOOP UNTIL Code - DictTeop. Code OS I > R 
* store previous -ores with the seee code 
n ■ x 
DO 

EesCetlEL PictTeap. LENDict, n, OictCodeHZ 
IF DictTeTo.Coce » Code THEN 

I? INSTRCOPRTrioSCPictTeap.Str), ■ *) THEN 
NcmCofflb a NuaConb ♦ 1 

REDXN PRESERVE CoabLiatd TO NuoCoab) AS UordCode 
Cc«txJst(NuaCeab).Code = DictTeap. Code 
CoabList(NuaCOBb).Str » DictTeap.Str 

ELSE 

NvroSing * NuaSing * 1 

REDIH PRESERVE SingListd TO HusSing) AS UordCode 
SingList(NuaSing).Code = DictTeap. Code 
SingList(!fuaSing).Str a DictTeap.Str 

EVP IF 

END IF 
n a n • 1 

LOOP UNTIL DierTeso.code <> code OR n * 0 

n a x ♦ T 

00 

EasGetlEl PictTeap. LENDict, n, OictCodeHZ 
IF PlctTenp.Coce s Code THEN 

IF INSTRCOPRTriaStoictTeflp.Str), ■ **) THEN 
Nuw Go ab B NusCoRb + 1 

REOIN PRESERVE CoabListd TO NusCoab) AS UordCode 
Ccfltx,ist(rtuaCaab).Code = OictTeop.Code 
ConbList(NuBCoab).3tr = OictTeap.Str 

ELSE 

NuaSing c NuaSing ♦ 1 



REDIH PRESERVE SingListCI TO NuaSing) AS UordCode 
SiixjListCNuaSing) .Code = PictTeap. Code 
SingL1st(NuaSing).Str =» DictTeap.Str 

END IF 

END IF 
n = n ♦ 1 

LOOP UNTIL DictTeap. Code <> Code OR n > OlctCodeNua 

NEXT 

SortT SingLiatd), NuaSing, 0, 66, 2, 44 
SortT CoabListd), NuaCoae, 0, 66, 2, 64 

-END SU8 

SU8 OosErrHendler STATIC 
IF DOSErrorS THEN 
Chi a* 5 

2?" TEL/. SrrorlisgSCUhichErrorX); • occurred while writing Abstract Index Doc OocJtuag 

rtlOSe ADSTTTWXF1 le 

FClose AbstrFile 
FClose DocNdxFfle 
FClose DocFlle 
ReleaseEHS 
END 

END IF 

END SUB 

SUB EasAllcc (NuaPacesS. HandleX, LoadPILES) STATIC 

CALL EasAllocitesCNuBPacesS, HandleX) 
IF EnsErrorZ THEN 

PRINT "Couldn't allocate-; CLIC(NuaPages) * SixteenK; "bytes of ENS for Losdf ILES 
CALL ReleaseENS * ^ 

CALL Chiae(2) 
END 

END IF 
END SUB 

SUB ExtractOoc CDocNuag. TotSentHuoX, HandleX. KoTextFlag) STATIC 
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DIM Doc AS ISAHtype 
CLS 

EfexO 11, 25, 1S, 55, 2. RevAttr 
PaintBoxO 11, 25, 15, 55, RevAttr 
■ QPrintftC "extracting cceuaent O m ♦ STRS(0ocNua4). 13, 28, RevAttr 
Handle * 0 ... 
NoTextFlag = FALSE 

' EXTRACT doeuoent frco tile 

FSetRT. DocNdxPUe. Doc, DocNusA, 8 

TextNuoS 35 DocLesta - Doc.Fir*t8 ♦ V lines of text in the file 
IF TextNuma > MaxText THEN 

QPrintRC 2, 2, -1 

• Allocate EflS to hold the Text fiLe 

NunPages = 6C8 • TextHuaft / SixteenX ♦ V 80 bytes per line, 

CALL EosAllbcreoCttuoPegea, HandleX) * 16K per EHS page 

IF EmErrori THEN PRINT "CouLdn't allocate"; NunPages * Sixteen*; "bytes of BIS.": STOP. 
ELSE 

IF TextNunft <= 0 THEN NoTextFlag = TRUE: EXIT SUB 
REDIN TextArrayd TO TextNuo8) AS TextType 

EKO IF 

> Read Ooeueent into ERS eliminating blank. Lines 

• change non-alpha chars Into spaces 

IF Lang* = "ENGLISH" THEN 

CALL ReadEnglishTextCPoc.FirstS, Doc. Last*, HandleX. TotSenCfcraX) 
ELSEIF LangS = -GERMAN- THEN 

CALL ReadfieraanTextCDoc. FlrstS. Ooc.UstS, HandleX, TotSentNUDX) 

END IF 

as 

END SUB 

SUS ExtractFullText (TxtEHS, FullEHS) STATIC 

• — EXTRACT full document from file (should be ooved into subroutine) 

OIN Doc as iSAnr/pe 

IF TextNuoS > HaxText THEN 

CALL EasftelHeo(TxtEMSX) 

•—EXTRACT full doeuoent froa file 

• — Allocate ErtS to hold the Text file 

MuoPeces » 608 * TextNuaS / SixteenK ♦ 1 "80 bytes per line, 

CALL EasAllocMe»(«uaPages, FullEHSX) 1 16X per EHS page 

IF EaaErrorX THEN PRINT "Couldn't allocate"; NuaPages * SixteenK; "bytes of EHS.": STOP 

NuaLines = 0 

DIM Teapl AS STR80 
LenTeop a LENtteopD 

FOR ifc o Doc.FirsU TO Doc.LastS 

FSetRT DocFile, Teapl, i«, 60 
NuaLines s NuaLines ♦ 1 
EosSetlEl Teapl, LenTeap, NuaLines, FullEHSX 



NEXT 



ERASE TextArray 

REDIH TextArray (1 TO TextNusU AS TextType 
Nua = TextKuaa 

Ems2Array TextArrayd), 128, Nua, TxtBiS 
CALL ensRelneaUxtEHSX) 



END IF 
END SUB 

SUB ExtractSent (SentHuaX, SentS, LenSenX, HandleX) STATIC 

DIM Teapl AS STR80 
LenTeop = LENC Teapl) 



-Get one sentence froa the document 



i = SentNdx(S«nt«ua).BL 
IF TextNua* > MaxText TrzH 

EosSetlEl Teepl, LenTeop. i, HandleX 

ELSE 

Tem pi. St r « TextArray CI). Str 

END IF 

Sent J » RTRIKS(LTRUW(R:5HTS(Teop1.Str. 80 - SentNdx(SentNun).BC ♦ 1») ♦ • ■ 
IF 5entHdx($entNua).EL > SentNdxCSentKuoJ.BL THEN 

FOR i = Se^tNdx(SentNua).BL + 1 TO Sentndx(SentNua).EL 
IF TextKuoft > MaxText THEN 

EosSetlEl Teapl, LenTeop. 1. HandleX 

. ELSE 

. Teapl. Str » TextArray (i).Str 
END IF 

Ser.tS • SentS ♦ LTSlHS(RTRIHS((Temp1.Str))) ♦ * ■ 

NEXT 

I » LEN(R7nIHS(Tcffip1 . Str) ) - SentNox.(SentNua>.£C 
SentS 3 LE-7SC SentS, LEN(SentS) - I) 

ELSE 

SentJ.= Le : T5( SentS, (Sentfldx(SentKua).EC - SentNdx(SentNun).BC ♦ D) 

END IF 
LenSen s LENCSentS) 
END SUB 

FUNCTION ExtractUordS (SourseS, Chart, Start) STATIC 

'extract word froa sources froa Start to CharS 

Sources « QPTrlaS (Sources) 
LenStr a LENCSourceS) 
Slen a 0 

FOR 1 a Start TO LenStr 

If MIOSCSo---:eS. i, 1> = CharS AND Slen > 0 THEN 
EXIT FOR 

ELSE 
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Slen = Slen ♦ 1 
END IF 

NEXT 

ExtractttordS « LTRMStHIOSCSoureeS, Start, Slen)) 
END FUNCTION 

SU8 ExtractVcnflton (Sources. tfordNun. Start, Slen) STATIC 

SpLoc « 0: Count = 0 

DO 

Start ■ Sptoc + t 

SpLoc « INSTRCStert, Source*, ■ ■) 
If SpLoc «> Start THEN 

" Count ■ Count ♦ 1 

END IF 

LOOP UNTIL Count = UordNua 

Slen * INS TR (Start, Sources. • •) - start 

EKD SUB 

SUB FindCoabKey (UordListSO, Ku*UordX, NuaKeyUordFoundX, SentNua, NuaXey, ConbLlstO AS JoroCode) STATIC 
IF SentNua = 1 THEN 

KeyUortFoundMdx(SentNuii). First = 1 ' ~ 

ELSE 

ii a 1 

. DO UHILE KeywortFoundNdx(SentNuffl - ii >. Last a 0 AMD ii < SentNua - 1 
11 = 1i ♦ 1 
LOOP 

Keyworo*ounc^( SentNua). First * Keyword Focnct*i*( SentNua - 11). Last ♦ 1 
END IF * 
NusdCeyWord Found « 0 
IF NusCoab = 0 THEN EXIT SUB 

FOR i - 1 TO NumUord' nuaber of words to process 

1 sake lower esse since coabined keywords ignore case 

UordS » LCAS£S(UordLlstS(i» 

IF UordS » GOTO SkipCeebKey 

IF (ASC(UordS) >•- 48 AND ASC(vordS> <- 57) GOTO SkipCoabKey 

*= Binary searches thru an CoabList for a range for first letter 
Laist = 0 

wS a LCASES(LEFTS(UordS, 1)). 

1 = 1: R a NuaCoabX ' teral nuaber of code entries 



x a (CLKS(l) + RJ \ 2 
IF wS < LCASES(LEFTSCCoabLlst(x).Str. D) THEN 
R = x - 1 

ELSE 

IF wS > LCASCSCLEfTSt CoabList U).Str. 1)) THEN 
I 3 X ♦ 1 

ELSE 'if equal, check every other word for first and Last 
2 

IF x > 2 THEN 
DO 

IF LCASES(LEFTS(CoobList<x - N).Str. D) < wS THEN 

IF LCASES(LErrS(COBbLlst(x - H ♦ D.Str, 1)) < wS T«ei 
First a x - H ♦ 2 
ELSE 

First • x - H ♦ 1 

END IF 
EXIT DO 

ELSE 

H « H ♦ 2 

IF x - H <■ 0 THEN First • 1: EXIT DO 

_ END IF_ I 

ELSE 

First a 1 

END If 

IF x « NuaCoab - 2 THEN 
H = 2 
DO 

IF LCASES(LEm(CeebLi$t(x ♦ H).Str, 1)) > wS THEN 

IF LCASES(LEFTS(CoabList(x ♦ N - D.Str, 1» > vS THEN 
Last a x ♦ H - 2 

ELSE 

Last 3 x ♦ H - 1 

END IF 
EXIT DO 

ELSE 

It • H ♦ 2 

IF x + n > NuaCoab THEN Last « NuaConb: EXIT DO 
END IF 

LOOP 

ELSE 

Last » NuaCoab 

END IF 
END IF 

END IF 

LOOP UNTIL wS a LCASESC LEFTS (CoobLi St (x).Str, 1)) OS I > ft 
IF Last THEN 



« if it's a valid range, then do coaparlsons for words in the range 
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FOP j = Last TO First STE? -1 

Words « InCount<OJRTHaS<CcsbListCj>.Str), - •) ♦ 1 'count nusber of words 

• the k eyword has core words than are left in the word list 

skip it, because there's no possibility of a oaten. 
If Words > NuaWord - i ♦ 1 GOTO SklpCccbKey 

CurrKeyS « ExtractWordS(Comblist(j>.Str, ■ 1) 'extract first word 
Slen a LENCCurrKeyS) 'of combined keyword 

Start = Slen + 1 

If R IGHTS( CurrKeyS , 1) = THEN 
Exact = TRUE 

CurrKeyS = LEFTS (CurrKeyS, Slen - 1) 
Slen = Slen - 1 

ELSE 

Exact = FALSE 

END IF 

' compare first word of combined key [ CurrKeyS] 

against the current document word CWordSl 

IF RIGHTSCUordJ, 1) = V THEN UordS s LEFTS<UordS, LEN(UordS) - 1> 

IF Exact THEN * check for "exact* natch 

Hatch = (LCASESC CurrKeyS) = LCASES(UordS) ) 

ELSE 

Hatch » aCASES(CurrKeyS) = lCASE$CLEFTS(UordS, SlerO)) 

ENO IF 

• no match, skip to next combined key in the First-Last range 
IF NOT Hatch GOTO SkipCombKey 

• continue oatching the rest of the words* in the coobined key 

• exiting out as soon as there's a non-match 
AtFlag = FALSE 

XotFlag = FALSE 



FOR k = 1 TO Uords - 1 1 



' of words left in coobined key 



• extract the next word from the current combined keyword <j> 
CurrKeyS a ExtraetU*rdI(Coablist(j).Str, ■ Stan) 
Slen - LENCCurrKeyS) 
Start = Start ♦ slen ♦ 1 

IF R16HTS<CurriCeyS, 1) = THEN 
Exact a TRUE 

CurrKeyS 3 LEFTS (CurrKeyS, Slen - 1) 
Slen * Slen - 1 

asE 

Exact = FALSE 

END IF 

IF AtFlag - FALSE AND Not Flag = FALSE THEN 

DocWordS o uordlistsn ♦ ooeuaent word to compare 

ELSE 

IF AtFlag - FALSE AND Not Flag = TRUE THEN 

IF k « Words - 1 THEN DocWordS = uorcLlstSU ♦ k ♦ 1) 'next word 

ELSE 

IF AtFlag • TRUE AND Not Flag ■ FALSE THEN 

DocWordS o WordListSd ♦ k - 1) 'previous word 

ELSE 

DocWordS » WoroXistSCi ♦ k) 

END IF 

END IF 

END IF 

IF RIGHTSCDocWerdS, 1) s THEN DocWordS = LEFTS (DccUordS, LEN(DocWordS) - 1) 

If CurrKeyS = "3" THEN ' special processing for a wildcard 
IF IHSTR (At ListS, T ♦ DocWordS ♦ •/') THEN 

Hatch = TRUE' the word was in the 9 list, so continue 

ELSE_ - - - . 

IF" langS = "GERHAH" THEN 
Hatch s TRUE 
AtFlag s TRUE 

ELSE 

natch s false 

END IF 

END IF 

IF Hatch THEN 

IF k « Uords - 1 THEN " "* 

DocWordS * uordListS(i ♦ k ♦ 1) 

IF DocWordS ■ ■not" OR DocWordS e "be" OR DocWordS = •nicht" THEN Not Flag a TRUE 
END IF 

END IF 

ELSE 

IF Exact THEN 1 check for 'exact* match 

Hatch s (LCASESC CurrKeyS) 3 LCASES(DOCiiordS)) 
ELSE * wildcard natch, only coapare 9 of chars In CurrKeyS 

Hatch 3 (LCASESC CurrKeyS) 3 taSttCUFTJCDocuordS, Slen))) 

End if 

END IF 

IF NOT Hatch THEN EXIT FOR 
NEXT word in current combined keyword 

IF natch THEN • this is a cosfcined keyword, so add it to the '.is: 
NuoKeyWordfound • (tuoKeyUonJ Found ♦ 1 
NusKey « HuaXey ♦ 1 

RED » PRESERVE KeyWordFoundO TO NusKey) AS ListType 
KeyWordF«xxl(NusKey).Code • CoebListCj) .Code 
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DO 

SlashLoc * IttSTRCCoabStrt. V") 

IF SlashLoc THEN HIDSCCcwbStrJ, SlashLoc. 1) « " ■ 

LOOP UNTIL SlashLoc = 0 

KeyUordfound(NusKey>.rtua = i 

blank out combined word froa list so that single «*ys 

• are not generated fron parts of contained keys fowd 

TenpStrS * 

FOR k » i TO i ♦ Words - 1 
DO 

IF RIGHTS(UordListS(k), 1) a •/« THEN 

UordListS(k) = LEFTSOtordListSCO, LOI(UordListi(k)) - 

ELSE 

EXIT DO 

END If 

LOOP 

TenpStrS « TenpStrS ♦ UordListSCk) ♦ " ■ 
UordlistS(k) a "» 

NEXT 

KeyUordFound(HuaKey).Str a RTR1HS (TeepSt rS) 
EXIT FOR 

SND IF 

SkipCoebKey: 

NEXT 

END IF 

NEXT* key in list 
END SU8 

sua FindSingKey (uordHsTSO. NuaWordX, NuaKeyVordFoundX, SentNu*, NuaKey, SingListO AS wordCode) STATIC 
•ARRAY NAME LEN DESCRIPTION DIRECTION I10DIFIE0? 



•SingListO VAR Single Keyword List (Shared) (Unchanged) 
•KeyWordFoundO VAR Single Keywords Found (Returned) (Changed) 
•wcrdllstSO VAR Doc^aent Words (Passed) (Unchanged) 

IF NunSing a 0 THEN EXIT SU9" 

FOR 1 = 1 TO Nuaword' nuaoer of words in document 

words • ttordListS(i) 

IF Words o GOTO SkipSingKey 

IF (ASC(Uord$> >= « AND ASC(UordS) <* 57) SOTO SkipSingKey 
PrefixFleg = FALSE: HaanPref ixflag a FALSE 
■ a»»»«* *M » > * * * * ************ 

Binary searches thru an SingList for a range for first letter 

TryAgain: 
Last 3 0 

wS a LCASE$(L£FTS(UcrcS, D) 

I = 1: R = NuaSingX 1 total nuaber of code entries 



x « (CLNG(l) ♦ R) \ 2 
IF wS < LCASES(LEFTS(SingList(x).Str, 1)) THEN 
R = x - 1 

ELSE 

IF w$ >.LCASES(LEFTS(SingList(x>.Str, 1» THEN 

I a X ♦ 1 « 

B_SE *if equal, check every other word for first and last 
N s 2 

IF x > 2 THEN 
• DO 

IF LCASES(LEFTS(SingList(x - lO.Str, 1» < wS THEN 

IF_LCASES(LEFTS(SingList(x - H _*1>..Str. D>_< wS_THEN 

First a x - H + 2 
ELSE 

First a x - N ♦ 1 

END IF 

an DO 

ELSE 

n = n ♦ 2* 

IF x - n «= 0 THEN First = 1: EXIT DO 

END IF 
LOOP 

ELSE 

First a 1 

END IF 

IP x <= NuaSing - 2 THEN 

H a 2 

DO 

IF LCASE$<lEFT$($ingLlst(x ♦ M.Str, 1)) » wS THEN 

IF LCASES(LEFTS(SingList(x ♦ H - D.Str. 1» > wS THEN 
Last « a ♦ H «• 2 

ELSE 

Last = x + H - 1 

END IF 
EXIT DO 

ELSE 

H a N * 2 

IF x *■ n > NuaSing THEN Last a Bussing: EXIT DO 
END IF 

LOOP. 

ELSE 

Last a NuaSing 
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END If 
END IF 

END IF 

LOOP UNTIL uS = LCASES(LErTS(SingList(x>.Str, D) OR I > 8 
IF Last THEN 

FOR j = las I TO First STEP -1 

CurrKeyS » QPRTMDS(SingLi*t(j).Str) 
Sten = LEN (CurrKeyS) 
, If RIGHTS (CurrKeyS, 1) = -/ B THEN 
Exact a TRUE 

CurrKeyS = LEFTttCurrKeyS, SI en - 1) 
Slen ° Slen - 1 

ELSE 

Exact « FALSE 

END IF 

•ceopere the single keyword CCurrKeyS/SingLlstm.strD 

'against the docuocnt word [ItordSJ 

If Exact THEN 1 check for *exect* match 

Hatch * (CurrKeyS • UordS) - 
ELSE 1 check for wildcard catch 

Hatch a (CurrKeyS » LEFTSdtordS. Slen)) 

END IF 

IF Hatch THEN ■ add the single keyword to the list 
NuaKeywordFound = NusKeyttordFourxJ ♦ 1 
NusKey = Mu&Key ♦ 1 

REDIH PRESERVE KeyUordfound(1 TO NusKey) AS ListType 
!CeyUortFound(NunKey).Code • SingList(j).Code 
ICeyUordFound(KunK^).Str = UordS 
KeyWordFound(NuaiCey>.Muo ■ i 
EXIT FOR 

ENO IF 



NEXT' key in range 
ELSE 

Letters » L£FTS(uordS, 2) 

IF Letters = **s" or Letters = "za" then 

SecNua • CFValLa(RlGHT$(VordS, LEN (UordS) - 2» 
IF SecMua > 0 AND SecNua <=> 3000 THEN * ~ 

IF UtterS = »»" THEN 

NuaKeyUordf ound * NuaKeyVordFound «• 1 
NusKey ° NuaKey ♦ 1 

REOIH PRESERVE KeyUordFoundd TO NusKey) AS ListType 
KayUordFound(NuaKey).Code = SecNua * SccCooe '10563 
KeyUordFound(»i»Key).Str a "See" ♦ STRS<Sec*ua> 
• KeyUordfound(NuaKsy).NuB a i 

ELSE 

NusKeyttordFound = NuaKeyUordFound ♦' 1 
NusKey = NuaKey 4 1 

REOIH PRESERVE KeyvdrdFound(1 TO NuaKey) A3 ListType 

IF SecKuo « 30 THEN 

KeyUordf oundCNunKey ) . Code = SecNua - Art Code '13563 
KeyUcrdPound (NusKey). Sir = -Art- ♦ STRS(SecNuffl) 
fCeyuortfound(NuBKey).Nua » 1 

ENO IF 

ENO IF 

. END 2 s 

ENO IF 

ENO IF 

IF NOT NeanPref ixFlag ~-SN 

•check for aesningful c-ef ixes. If found* divide word in two parts 

UordS a LCASESCwordS) 

Lent* a LEN(UordS) 

FOR NuaLet a 14 TO 3 S^== -1 

IF Lenu > NuatAT + 3 THEN "should leave at least 3 letters 

IF INS^ (NeanPref ixesSCitunlet), °\" ♦ LEFTS (UordS, RmsLat) ♦ "\") ~r*H 
uerdTeopIS = HIDSttordS, NuaLet ♦ 1) 
wordS a LEFTS (words, NUBLet) 
NeanPref ixFlag a TRUE 

'revHatch » Hatch 'save, because Hatch will change for the srtf, 

EXIT- FOR — 

ENO :« 

ENO IF 

NEXT 

IF NeanPrefixFlag THEN SC"0 TryAgain 'check again 
ELSE 

IF UordTempIS «> "* THEN 
IF Pre.-atsh THEN 
Lleit » 9 

ELSE 

Liait « 6 

END I? 

- ycrdS o UordTecpIS 
-crdTenpIS a ■■ 

:? LEN(UordS) » Li ait THEN GOTO TryAgain 

ENO IF 

END IF 

•check for oeaningles* r^tfixes and delete it 

IF NOT Pref ixFlag AND *CT Hatch THEN 'only one ties 

UordS a LCASES;-croS) • 

LenU a LEN(VersS) 

FOR NuaLet s8*C2 STEP -1 

IF LenU > Nuo_et * 3 THEN 'should leave at least 3 letters 

IF INS"(PrefixesS(NuaLet), "\" ♦ LEFTSOtordS, NuaLet) ♦ *\ m ) THEN 
UordS = HlOSOtardS, RuaLet ♦ 1) 
?ref1xFlag » TRUE 
EXIT FOR 

ENO •» 

ENO IF 
NEXT 
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IF Prefix? Lag -~N 
Unit = 6 

IF l£s:-ordJ) >= Liait THEN GOTO TryAgain 

END IF 

END IF 

SkipSingKey: 

NEXT 1 word in docuaent 
KeyUordFeuralNdjtCSentNuBi.Last = KuaKey 
END SUB 

SUB GetWLlst (DocNuaS. StagListO AS itordCode, CccbListO AS UordCode, KeyTeap AS <e/Ndx127) STATIC 

KeyLen = LENCKe/Teap) 
'EHSGet KeyTeap, KeyLe-.. DocNunS, KeyEasX 

FGetRT KeyEBS, ttyTeap, DocHumS, KeyLen 

get only first 27X keywords and store the* with synonias in (UN 
IF Key Trap. Mm > 10 THEN 

KeepNue » KeyTeap.Nua * .27 
IF KeepNua < 7 THEN KeepNua = 7 

. ELSE 

KeepNua = KeyTeap.Nua 

END IF 

NuaCoab = 0: NuaSing a 0 'both are caaaon snared 

CALL DietsortSeareMKeyTeap, KeepNua, StngLlstO, CoabLlstO) 

END SUB 

FUNCTION GetSentValue! (SentNua, NuaWord, Value AS vieightAvgNdx127, KeyTeap AS KeyHdx127> STATIC 
>*** value sSUNofUniqueKU(SV*PV"l/8> /SOjt(TotalhuaberOfUords) 

SentVaU = 0 

OldStrS = "* 

FOR i = lCeyWordF«indNd*(SentNuii>. First TO KeyVordFoundRdxCSentNusO.Last 
'don't take word which is chopped already 
IF Keyword Found (i).Nua <> 0 THEN 

'don't take the saae word twice 

IF INSTFKOldStrS, STRS(Key«ordFcundCi).Cede)) = 0 THEN 
•find the word 
FOR i » 1 TO KeyTeap.Nua 

IF KeyTeap.Ccdo(j) = KeyvordFound(i).Code THEN 
SentVaU = SentVaU ♦ Value.UtightCj) 
OldStrS = OldStrS ♦ STRSCKeyTeap.Coe*< j)X 
EXIT Foa 

END IF 

NEXT 

END IF 

END IF 

NEXT 

GetSentValue i * SentVaU / SQRCNunStard) 
END FUNCTION 

FUNCTION InstrTblX (Start. 1 :, Sources. Chars* > STATIC 
■returns position firs- aet char Trca charsS in Sources 

vx a o 

FOR 1 = 1 TO LENCCharsS) STEP 2 

VX s iNSTKCStart. Sources, «IDS(CharsS, i, 2)> 
IF VX > 0 THEN EXIT FOR 

NEXT 

InstrTblX =.VX 
END FUNCTION 

SUB LoadData STATIC 

* Load in Word Sorted Dictionary directly into ENS (translate Coded to UordS) 

EasPg ° EosGetPFSegX 

"DDI OictVrdfeap AS~Dict7ype 
FileS - UtOirS ♦ "DICT.tfrd" 'to get SecCode only 
IF NOT ExistX(FiUS) THEN CLS : PRINT FileS; " not found.": END 
SiicofFile& = FileSizeSC FileS) 
DictUrdNua = SizeofFilei \ LEN(MctUrdTeap) 
SecCode » DictUrdNua - 3030 ♦ 1 
ArtCode a DictUrdNua - 30 ♦ 1 

FileS * LStDirS ♦ •DICTSC2T. COO- 
DIN DletCodeTeep AS Diet Type 

IF NOT ExistX(FileS) THE* CLS : PRINT FileS; " not found.*: END 
Sixeof FileS = FileSlxeK FileS) . 
OictCodeNua a Siiecf FileS \ LEN(DictCodeTeop) 

NuoPages * S1zeofFlle& \ SixteenK ♦ 2' round off to nearest 2 pages 

EosAlloe auaPages, DictCodeNX. FileS 

■ IF EasErrorX GOTO EHSErrtlandler • ■ 

Hun32kB locks » Si zeof Files \ ThirtyTwoKS 

Leftovers = Si wof FileS - (Kuo32kB locks * ThirtyTvoKS) 

FOpenS Files, OictCodeFILE 
FOR i o 1 TO Nuo32kBlocks «■ 1 

BoxO 11, 5, 15. 70, 2 , RevAttr 
PaintSoxO 11, 5. 15, 70, RevAttr 

OPrintRC "Loading • ♦ FileS ♦ " block- ♦ STRSCl) ♦ - r ♦ STRS(Nua52kBlocks ♦ 1), 13, 12, RevAttr 

•— wop pages of the' DietCodeHX aeoory to the ENS upper oea page fraae 

FOR J a 1 TO 2 * 
EasHapnea DietCodeHX, j, Ci - i) • 2 ♦ j 
' IF EasErrorX GOTO ENSErrHandler 
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seek to beginning of current block 
FSeek DictCodeFILE, Ci - 1) * ThirtyTWott 

IF 1 < Nuo32kBiocks ♦ 1 THEN 

'7 B«/ he 32)1 bLock put it directly Into the ENS page frane 
FCetA DictCodeFILE, BYVAL EfisPg, BYVAL 0. ThirtyTwctt 

ELSE 

■— load the left over (<32k) byte* 

FGetA DictCodeFILE, BYVAL EasPg, BYVAL 0, Leftover* 



NEXT 

F Close OietCodeFILE 

as 

• — r Load weight. Ndx into ENS 
FileJ = NdxDi r$ ♦ -NEUVAL.NOX*' 
IF MOT ExistX(FUaS) THEN 
CLS 

PRINT FileS; • not found. - 

cau. Releasees 

STOP 

END IF 

FOpenAU Files, 0, 4, ValueBiS 
'ValueENS * LoadlntoENS CFileS) 
CLS 

' Load Key. Ndx into ENS 

PUeS s NdxDlrS + •NELICEY.NDX ,, 
IF NOT ExIstXCFIleS) THEN 
CLS 

PRINT Files; ■ not found." 

CALL Releasees 

STOP 

END IF 

FOpenAU FileS, 0, 4, Key ENS 
'KeyENS * LoadlntoEHS(PileS) 
ENO SUB 

FUNCTION LoadlntoEHS CFileS) STATIC 

' : Returns the handle where the file was Loaded into - 

EnsPg * EnsGetPFSegX 
SizeofFilea = FileSireiCFileS) 

NuaPagea = Si«ofFUe» \ Sixteen* * 2* round off to nearest 2 oaoes 
EbsAUoc NuaPagea, FileEM, FileS 

NumKkBloeks » SizeofFilea \ Thirty TwoXS 

Leftover* a sizeofFileft - <Nua32kBlocks * ThirtyTvott) 

FOpenAU Files, 0, 4, Load FILE 

FOR 1 = 1 TO Nua32kBlocks ♦ 1 

BoxO 11, 5, 15, 70, 2 r RevAttr 

PalntBoxO 11, 5, 15, 70, RevAttr 

OPrintRC -Loading - ♦ FileS ♦ • block- * STRS(i) ♦ ■ /- ♦ STRSCNuaMkBiocks * 1), 13, 12, RevAttr 
\Z ?*?!\ of *• Mfflor y to the ENS upper eeo page fraoe 

. EnsHapNei FIleEHS, j, Ci - 1) • 2 ♦ j 

IF EasErrorX THEN PRINT "Ens error:-; EasErrorX: STOP 

NEXT 

seek to beginning of current block 

__. FSeek LoodFlUE, (1 - 1) ^ThlrtvTWott - : 

IF OOSErrorX THEN PRINT "Dos Error:*; yhiehErrerX' STOP 
IF i « Nua32kBlocks ♦ 1 THEN 

. get the 32k block and put It directly into the ENS pac-e fraac 
FGetA LoadFILE, BYVAL EroPg. BYVAL 0, ThirtyTwott 

IF OOSErrorX THEN PRINT "Dos Error:-; VMchErrorX: STOP 

ELSE 

. load the left over («32k) bytes 
FGetA LoadFILE, BYVAL EnsPg, BYVAL 0, LeftOverS 
IF OOSErrorX THEN PRINT -Dos Error:-; VhichSrrorX": STOP 

ENO IF 

NEXT 

FClose LoadFILE * 
CLS 

LoadlntoEftS = FIleENS 
END FUNCTION 

SUB Rank (Best Sent, NoNoreFlag, KudGeyUordsO, NusferdO, Value AS Ue1ghtAvgNdx127, KtyTcsp AS K*yNoYI27) STATIC 

'this procedure calculates value for each sentence and finds the best one. 

RED IN SentVelue<1 TO TotSentNua) AS SentValueType 
FOR SWlua ■ 1 TO TotSentNua 

IF NusXeyVords<SentNua) » 0 THEN 

SentValueCSentltuo). Value ■ GctSentValueUSentnua. NuatfordtSentNua), value, KeyTeup) 
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SentValueC SentNua). Value * 0 ' 

END IF 

NEXT SentVaUe<SentNua).Mua = SentNua 

FOR 1 = 1 TO TotSentHua 

IF SentValue<i).Value > k! THEN 
Best = i 

M' = SentValueCi). Value 
END IF 

NEXT 

IF Beat a 0 THEM NoHore?lag » TRUE: EXIT SUB 
BestSent = SentValue<Sest).Nun 
ERASE Sentvalue 
END SUB 

SUB ReadEnglishText CFirstLineft. LaatLineft, Handle*, TotSentNutf) STATIC 
PW Teopl AS S7R50 • ■ 

REDW SentNdxCI TO 1) AS SentNdxType 
LenTenp ■ LENCTenoD • s 80 

SSllSnVo ' t0ML ° f inpUt fr0a fU * 

EndofSentenceS = \ ! ? ; - 

«* FSZ ViZT™-*- ' 1! ■'=»■•> 

IF TextNunft > HaxText THEN 

FOR ift a »1rstLlne& TO LastLinet 

FSetRT Dot File, Teapl, i tt , LenTemp 

^S^^S^V^ Section5<) ' * rt,eU5W> • 

^ EssSetlEL Teapl, LenTeep, KuaLinea, Handle 

ELSE 

FOR i& = FirstLineft TO LastLlne* 

FSetRT DocFlle, Te«p1, ig, LenTenp 
Ntatinea ■ NunLlnes + 1 
TextArreyCNweLineshStr = Teapl.Str 
NEXT 

FOR 1 - ? r T? f 2S < n T er ArrV(1) ' 128 ' 

NEXT ReadSect1on(Te «*r«V«).Str. SectionSO, ArticleSC)) 

END IF 

*— Process text 
CurrLine a 0 
GOSUB GetKextLine 
DO 

DO 

*~ 4k1 P »Lan* Unas, or if we've gone too far 
DO WHILE Start > LENTxt 

IL^yl! ? W,Llnea 6010 EndOfFUe 
G0SU8 GetNextLine 



LOOP 

p = InstrTblXCStart, TxtS, EndOf Sentences) 
IF p = 0 THEN 

IF CurrLine = MuaLines GOTO EndOfFi la 
60SU8 GetNextLine 
. p»0 

ELSE 

Start * p ♦ 2 

EndSignS - HIMCTxtS. p, 1) 

END IF 

LOOP UNTIL p loop until we've found the end of sentence 

IF p > 2 T HEN it's an end of sentence 

ts -tiT.frwt . » , « ( i'«^. e f |,t J* w *• M end » sentence) 

IF NIDSCTxtJ p-2 1) <>-- THEN '- 2nd char before end of sentence location is not a apace 

if mraaxts, p ♦ 2. 1> - THEN we nave two spaces after the end of sentence 
save potential end of sentence (line ft cotuan) * 
EL a LineNua: EC a p 
'— look for beginning of sentence ore 
* the first alphanumeric character in that sentence 
eh = MdCharCTxtS, p) 
First Flag a true 

DO UNTIL (Ch jaW A» Chen 122) OR. (ch »a 48 AMD Ch «a 57) OR (eh >a 65 AND ch <a 90 

" * "f* I^^tSe? 0 ??r » tFU * ™» * • p: bl « LtneKuo: FlrstFiag a 
IF CurrLine = HuxLines GOTO EndOfFi le 
GOSUB GetNe. tLine 

END IF 
ch a HidCharCTxtS, p) 

LOOP 

V (ch ~««H0 ch - «) « n*B*-i™. p ♦ 1) . * «, Wdctardxrt. p ♦ 1) = « 0 

SemNdx( SentNua). EL a El: s~T*fr(S«ntRua) EC a EC 
s • SentNua. . 
SentNua • Serrtttua ♦ 1 

REDD! PRESERVE Sentftdxd TC SentNua) AS SentNdxTvM 
IF TextKuea > KaxText THEN ^ 
NIDSCTxtS, p, 1) » LCAS2S(MDS(TxtS. p, 1)) 
Teapl.Str • TxtS : 
EaaSatlEl Teopl, Le-"e^. CurrLine, Handle 
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ELSE 

R10SCTextArray(Curr_i-e).Str, p, 1) = LCAS£S(HIOS(TextArray< CurrLine). 

END IF 

SentNdx(SentNun).8L = BL: S^.tscx (SentNua) .BC = BC 
Start • p 

END IF ' — end check for uppercase letter 

search for the next [potential] IZi no natter 
* if we found a sentence (above) s» not 
GOTO NextEnglSearch 
END If ■— two spaces after EOS 
END IF not end of line 

at this point we have: xx._x 
EL 3 LineNua: EC = p ' — save potential EDS info 

this loop extracts the word to the left of the ESS 
n = QlnstrSCp, TxtS, » •) 

extract the previous word 
PreWondS = LCASES<HIDS(TxtS, n ♦ 1, p - n - 1)) 

check to see If the previous is aaong the 8 abbreviations 
AbbrevEnglNun = 8 

CALL FlndExactCVARPTRCAbbrevEngLSCD). AbbrevEnglltwr. PrevttordS) 
' — 1f we found the abbreviation then this wasn't a- ECS 
IF AbbrevEnglNua «» -1 THEN GOTO Next Engl Search 

wasn't found, nove p to point to the space (afte- the EOS char) 
P » P ♦ 1 

IF p » LENTxt THEN we vere already at the end a* the line 

• so get the next line of text 

IP CurrLine = NuaLines THEN 
GOTO EndOfFile 

ELSE 

GO SUB GetMextLine 

END IF 

END IF 

f [sane loop as above! 
First Flag ° TRUE 



DO 



ch s HidCharCTxtS, p) 

IF ch «> 32 AND Ch <> 0 AND FI-stFLag THEN SC » p: BL = LineNua: FirstFlsg = FALSE 
P » p ♦ 1 

IF p > LENTxt THEN 

IP CurrLine = KunLines SOTO EndOfFUe 
GOSUB GetMextLine 

• END IF 

LOOP IMTIL (ch >= 97 AND ch <= 122) CS (ch >= 48 AfO zr> « 57) OR <ch >= 65 AHD ch <= 90) 
*— if it's not an uppercase letter, continue searching (this wasn't en EOS) 
IF (ch >= 65 AMD ch 90) 08 HidCharCTxtS, p) - 46 OR rHcCr*r(TxtS, p) = 41 OR EndSignS a THEN U 

'— now we assttae that it's an EOS, save infs 

Sentffdx (SentNua). EL » EL: SentNdxC SentNua). EC * EC 

• = SentNua 

SentNua = SentNua ♦ 1 

RED IB PRESERVE SentNdxCI TO SentNua) AS SenzSoxType 
IF TextNuaa > HaxText THEN 
IF p > 1 THEN 

HIDSCTxtS, p - 1, 1) a LCASESCKIDSCTxtS, p - 1, 1)) 

cicy 

RIDSCTxtS, p, 1) a LCASESOCSKTxtS, p, D) 

END IF 

Teopl.Str s TxtS 

EnsSetlEl Teopl, LenTesp, CurrLine, -andle 

ELSE 

IF p > 1 THEN 

BIDS (Text Array (CurrLine). St-. ? - 1. 1) = LCASESCRIDS(TextAr ray (CurrLine). St r, 
RIDSaextArrey(CurrLine).Sir. p. 1) = LCA3ES (h* IDS ( Text ArrayC CurrLine). St r, p, 

END If 

Sen tNdx( SentNua ).BL ° BL: SentNdx(SentNua) = BC 

Start • p ._ . 

END IF . 

END 2? 

NextEng ISeerch: 

LOOP UNTIL Curr-.ire >= NumLines 

EndOfFile: < 

SentNdx<SentV.a).EL = EL: SentNdxC SentNua). EC a 80 

TotSentNua = SentNua 

IF TextNua* » HaxText THEN 

FOJ * a 1 TO ActualLln 

EasGetlEl Teopl, LenTeop, i. Handle 
TxtS o QPRTrlnS(Teapl-Str) 
CALL. OiangeChar(TxtS, KeepS) 
Teopl.Str o TxtS 

EasSetlEl Teapl, LenTeop, 1, Handle 

NEXT 

ELSE 

. FOS i » 1 TO ActualLin 

CALL ChangeCharaextArray(1).str, Keeps) 

NEXT 

END IF 
EXIT SUB 



GetMextLine: 

CurrLine ■ CurrLine ♦ 1 

Start » 1 start scanning at first position 

P a1 
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LineNua = Line** ♦ 1 
ActualLin « Acr.alt.in ♦ 1 
IF TextKuafc > Aa*Text THEN 

EasGetlEl Teap1, LenTecp, CurrLine, Handle 
TxtS = C?RTrioS(Teop1.Str) ♦ • » 



TxtS = G?»TriiB$<TextArray<Cttrrtine>.Stp) ♦ ' 



ELSE 
END IF 

iVt^ •°!-^ n !' ^1 sure there's one space at the end 
so that we an find end of sentences (looting far a DOT ft SPACE) 

• even if they're at the end of the line. 

LENTxt = LEN(T*tS) 

RETURN 



END SUB 

SUB ReedGeraanText CFIrstLlneft, LastLlneft, HandleZ, LinX) STATIC 

DIN Teapl AS STR80 
REOm SentNdxd TO 1) AS SentttdxType 
' LenTeap = LEN(Teapl) ' - 80 
NuaLines - 0 * total nuabcr of Hoes input frca file 
ActualLin ■ 0 

EndOfSentenceS ■ ".£?■ * 

LineNua = 0: Serbia » 1: SentNdxCD.BL - 1: SentNdxd). 8C = 1: s = 1 
IF TaxtMuaft > «ta»Text THEN 

FOR ig = FirstLineft TO LastLlneft 

FGetRT DocFile, Teapl, ift r 80 

CALL ReedSectionCTeapl.str, ParegrephSO, ArtikelSO) 

NuaLines = NunUne* ♦ 1 

EasSet/IEl Teapl, LenTeap, NuaLines, Handle 

NEXT 



ELSE 



FOR ig s FirstLineft TO LastLineft 

FGetRT DocFile, Teapl, ift, LenTeap 

KjaLines = NuaLines ♦ 1 

Tex tAr ray (NuaLines). St r = Teapl. Str 

NEXT 

Array2Eas TextArrayd), 128, NuaLines, Handle 
FOR i a 1 TO NuaLines 

CALL ReadSecTlonCTextArray<i).Str, Paragraphic), ArtikelSO) 

NEXT 



CurrLine = 0 
GO SUB Next Line 
DO 



Process text 



'— »*ip over blank lines, or if we've gone too far 
30 WHILE Start > LENTxt 

IF CurrLine « NuaLines GOTO EndHle 

GOSUB NextLine 

LOOP 

s = InstrTbUCStart, TxtS, EndOfSentenceS) 
I? p ■ 0 THEN 

IF CurrLine « NuaLines GOTO End File 

GOSUB NextLine 

p a 0 

ELSE 

Start « p ♦ 2 

S« IF 

LOO? sXTIL p •— loop until we've found the end of sentence 

IF p » 2 THEM »— It's an end of sentence 

• (i.e., can't have x. as the enc of a sentence) 
*— extract the previous word 
n = QlnstrBCp, TxtS. " ■) 

PrevttordS = LCASESCRIDSCTxtS, n ♦ 1, p - n - 1)> 



NOSFleg ° FALSE 

IF WWjTxtt^p - 2, 1) - - - THEN NOSFlag = TRUE'- 2nd char before end of sentence location is not a 
P it SSSJ!* L~Z 2°!S 8t «* °* We t1ne - «» check the next word/char. 
IF HIDSCTxtS, p ♦ 2, 1) - • " THEN ■— ue *»,e two spaces after the end of sentence 
— "ve potential end of sentence aire ft coluan) 
EL a LineNua: EC » p 
'■— look for beginning of sentence anc 
' the first olphartueeric chare etc in that sentence 
ch a NtdChar(TxtS, p> 
FirstFleg » TRUE 

00 OTTIL Cch >« 97 AND ch 122) C? (Ch >= 48 AND ch « 57) OR Cch » 65 AND ch « 90 
P»P4 1 

- IF ch «> 32 AND ch <> 0 and FirstFleg THEN BC ° p: BL 3 Llnehua; FirstFlag = 
IF p » LENTxt THEN 

• IF CurrLine - NuaLines GOTO EndFile 
GOSUB NextLine 

END IF 
ch a HWCharCTxtS, p) 

LOOP 

IF ch » &5 AND ch « 90 THEN '— !*oercase letter 

this is definitely a new sentence, so save Location info 
SentNdx(SentNua).EL » EU SentNdxCSentHuai.EC • EC 
s = SentNua 
S entNua = SentNua ♦ 1 

MOW PRESERVE SemNdxd TO- SentNua) AS SentHdxTv™* 
. 5^(SentHus>.BL o BL: S^tl^^BC^^T 

IF end check for uppercase letter 
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search for .the next Cpotentiall £05 no natter 
* if we found a sentence (above) not 
u GOTO NextSearch 

END IF tvo spaces after EOS 
END IF not end of line 

* — at this point we have: x. x 

a a LineHua: EC » p save'eotential EOS info 

' look for the first alphanusaric character 

FirstFlag = TRUE 

DO 

p = p*1 

ch = H1dChar(TxtS, p) 

IF ch <» 32 AND ch «> -1 AND ch <> 0 AND FirstFlag THEN BC = p: BL » LineNua: FirstFlao = PALS 
IF p > LENTxt THEN ^ rAL5 

IF CurrLine ■ NuaLlnes GOTO EndFHe 

GOSUB RextLine 

END IF 

LOOP UNTIL (ch >« 97 AW> ch <= 122) OR (ch >= « A« eh <= 57} OR (ch >= 65 AND ch <= 90) 
'— 1f it's not an uppercase latter, continue sesrcr.ing (this wasn't an EOS) 
IF ch < 65 OR ch > 90 THEN GOTO Next Search 

extract the word following the EOS 
n ■ p 

DO 

chS - RIDSttxtS, n, 1) 
n « » ♦ 1 

LOOP UNTIL *(chS < "A* OR chS > °r) AND (chS < "a" C3 chS > "i") OR n > LENTxt 

'— "following word" 

FollUordS • HIDSCTxtS, p, n - p - 1) 

•— check to see if this is one of the 316 noise wo res 

NoiseNua - 316 

CALL FindExeet<wARJTO(Noi3eS(1)}, NoiseNua. FolltforcS) 

IF NoiseNua <> -1 THEN »— it was a noise word, so .e know it's an EOS 

SentNdx(Sentftua).EL • a: SentNdx(S«ntNua). £: = EC 

s = SentNua 

SentNue «. SentNua + 1 

REDIH PRESERVE SentNdxd TO SentNua) AS SentJwtxType 
SentNdx<SentKua).BL = BL: SentNdx( SentNua). =c =' BC ' 
. Start a p 
goto NextSearch 

END IF 

IF KOSFLag THEN GOTO NextSearch 

check to see 1f the previous is aaona tre 15 abbreviations 
AbbreviNUB * 15 

CALL F1ndExact(VARPTR(Abbrev1SC1)), Abbrevlvj, PrevttordS) 
'— If we found the abbreviation then this -isn't an EOS 
IF AbbrevTNua <> -1 THEN GOTO NextSearch 

wasn't found, move p to point to the space (after the EOS char) 

check to see if Length < 6 (which would not be a new sentence) 
TrySentS • " 

IF SentNdx(9entNua).BL - LineHua THEN '— sentence starts on current line 

TrySentS - HIDSUxtS, SentNdxC SentNua) .BC, E: - SentNdx(SentNua).BC ♦ 1) 

ELSE 

IF SentNdx(SentNua).BL ■ LineHua - 1 THEN 

TrySentS «* RIGHTS(PrevLlneS, LER(PrevLinel) - 5entNdx(ScntNua).BC ♦ 1) ♦ LEFTSCTxtS, EC) 
END IF 

END IF 

TLen o LEN (TrySentS) 
IF TLen THEN 

CALL CRUNCHCTrySentS, ■ \ TLen) 

TrySentS = LEFTS (Try Sen tS r TLen) 

Nua = InCountttrySentS, * -> ♦ 1 
^ ip IF Nua < 6 THEN GOTO NextSearch Sentence has leas then 6 words, so it wasn't an EOS 

now we assume that it's an EOS, save info 
SentHdx(SentNua).EL a EL: SentNdx( SentNua). EC a EC 
s s SentNua 
SentNun * SentNua ♦ 1 

REOIH PRESERVE SentNdxd TO Sentffua) AS SentNdxType _ _ 

SentNdx(SeinNua);BL = BL: SentHdxiSentNuaKBC^BC 

Start p 



LOOP UNTIL CurrLine >= NuaLines 

EndFile: 

SentNdx(SentNu»).a = TextNuaS: SentNdx<SentNua).EC - 80 

TotSemNua * SentNua 

IF TextNun* > KaxText THEN 

FOR i « 1 TO ActualLin 

EaaGetlEl Teapl. LenTeop, i. Handle 
CALL Lower (Tempi. str) 
TXtS o QPRTriBS(Tenpl.str) 
CALL ChengeCharCTxtS. Keeps ) 
Teapl. Str ■ TxtS 
NEX7 EasSetlEl Teapl, LenTeap. i. Handle 

ELSE 

FOR i * 1 TO ActualLin 

CALL Lower (TextArrey(i). str) 
^ CALL ChangeCharCTextArray(i).Str, KeepS) 

END IF 
EXIT sua 



NextLlne: 

Cur nine a CurrLine ♦ 1 
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Start « 1 *— start scanning at first position 
P- 1 

Linefiua ■ LineNua ♦ 1 
AetualLin - Actual Lin ♦ 1 

»— M ve the previous Una for checking Length of sentence 

• (in the eases where a sentence crosses the line boundary) 
PrevLineS ■ TxtS 

IF TeKtKuaS > Ha«Taxt THB* 

EasGetl-i T«apT, LenTecp, OirrLine, Handle 
TxtS = SPRTMBS(Tenpl.Str) ♦ " " 

ELSE 

TxtS = 2?RTrinS<TextArray(CurrL1ne).Str) ♦ ■ " 

END IF 

tria down end of Line, but cake sure there's one space at the end 
' so that ua can find end of sentences (looking for a DOT & SPACE) 

• even il thev're at the end of the Une. 
LENTxt = LENUxtS) 

RETURN 



•SU3 teadSection (TxtS, SetArraySO, ArtArravSO) STATIC 
• — Look for "sections" .or "articles" 

IF Langs * "GEWWT thin 

'— In GERMAN it's Par 
SearchStrS * "Par" 

ELSE 

in English it's Sec 
SearchStrS » "Sec" 

END IF 

Letters * "is" 

FOR LOOkStep » 1 TO 2 

Start =1 

DO 

n » IN S7a( Start, TxtS, SearchStrS) coluan of start of Sec. or Art. 
IF N THEN 

j o instrcn, TxtS, " ") position of the end of the word 

IF j THSt * — if this is not a last word 

WordS = MDSCTxtS, H. j - fl) '—get the whole wore 

EL5E 

EXIT DO '— this was the lest word, so exit 

END IF 

check if the word matches variations on Section or Artie:* 
IF LangS * "GERMAN" AND LookStep » 1 THEM 

ELSE 

NuaFound Q 3 *— there are three variations that we cheek for 

END IP 

IF LookStep * 1 THEN 

CALL FindExact(VARPTR(SecArrayS(1)), NunFound, words) 

ELSE 

CALL FindExaet<VARPTRCArtArraySC1)>. Huafound, WOrdS) 

END IF 

IF KuaPound <> -1 THEN •— it did natch, so cheek the nuxber 
k = | ♦ 1 starting position of CpotentiaU numoer 

do ' — skip over blank spaces 
Ch = MidChar(TxtS, k> 

k'ktl 

LOOP UNTIL ch o 32 01 k > LEN(TxtS) " 

. al =» 0 

DO '— collect the whole nunber 

ch « RidChartTxtS, k ♦ art - 1) 

ml • Ml ♦ 1 

IF k + a1 - 1 ► LEM(RTRlMS(TxtS)) THEM EXIT 3C 
LOOP UNTIL Ch « 48 OR ch > 57 

IF art > 1 THEN ' there is a nuaber 

NuabS a RIDSCTXtS, k - 1. «1 - 1) 

IF QPValU(KusbS) < a 3000 AND QPValLS(NuabS) > 0 THEN 

»— if we're looking for Article nunoers, don't accept 
• article lumbers over 30 
. IF LookStep = 2 AND QPValLa(fMnbS) > 30 THEM 60T0 MextStep 
RewUordS a Letters ♦ NunbS 
al = INSTRCk, TXtS, " ") 
IF a! a 0 THEM 

■1 ■ LEM(TxtS) 

ELSE 

ftadtordS * NewttordS ♦ STRINGS Cal - H - LENCNewVoroS), 32) 

END IF 

TxtS = UPTSCTxtS, M - 1) ♦ MewWoroS * HlOSUxtS, *1> 

• END IF 

ELSE 

IP SearchStrS » "Art* THEN 
ai*0 
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loop while It's a Roman nuaeral ana we're not 
' past the end of the string 
00 

chS = MDSCTxtS, k ♦ o1 - 1. 1) 
ol = ml * 1 

IP k ♦ Bl - 1 > L£N(RTRinS(7<:S» THEN EXIT 00 
LOOP WHILE DOmCIVX". ChS) 

NuabS » BIDSCTxtS, k - 1, D1 - 1) 

' — translate the Boaan nuaeral (s) ta Arabic nuaerels 
NuaFound = 30 »— there ore 30 Reran nuabers to check. 
CALL FindExart<VARPmNuabersSC1)), MuaFeund, KuabS) 

IF NuaFound -1 THEN 

NewUordS = "za" ♦ LmilSCSTnSCRuaFound ♦ D) 
ol » iNSTRCk, TxtS, ■ ") 
IF o1 = 0 THEN 

TxtS =» LEFTS C TxtS, K - 1) ♦ NewUordS 

ELSE 

NewUordS = NewUordS ♦ STRIN6SCn1 - H - LENC NewUordS), 323 

TxtS = LEFTSCTxtS, K - 1) ♦ NewUordS ♦ RIGHTS (TxtS, LEN(TxtS) - ol ♦ 1 

EM IF 

END IF 

END If '— are we searching tor an Article *? 
END IF there's a number after the Section/Article 
END IF did we find a variation of Section or Article? 

END IF '— INSTR<Text, M Sec.-> was found . _ 



lextstep: 

Start » n ♦ 1 

LOOP UNTIL n * 0 

SearchStrS » "Art- 
Letter* ° "za" 

NEXT 

' — start Looking at the beginning of the line 
Start « 1 

00 

• — Look for the section syabol 
H « INSm Start, TxtS, CHAS(21» 

1 — if we found one, process it 
IF N THEN 

position right after the syabol 
k = n ♦ 1 
Bl * 0 

loop until it's not a nuober (a space 1s ok, however) 
* or we've reached the end of the string 
DO 

ch » NidCharttxt*, k + ol) 
al ■ oil ♦ 1 

I? < ♦ o1 - 1 > L£N(RTR INS (TxtS)) THEN EXIT DO 
LOOP UNTIL <ch < AS CO OR ch > ASC9) AND di <» 32 

' — the rwaber is the position from right after the syabol Jk) 

I to_tre.non-^ui^-poaitien-found-1n the loop above (b1 - 1) 

NuabS • SFTriaSCBIDSCTxtS, k, o1 - 1)) 

IF QPValL3 (NuabS) <= 3000 AND QPValL&(NuabS) > 0 THEN 

Ke-uordS = "is" ♦ NuabS 
ol = INSTRCk ♦ 1, TxtS, " ■) 

IF al THEN 

TxtS a LEFTSCTxtS, H - 1) ♦ NewUordS ♦ WOSCTxtS, 

ELSE 

TxtS = LEFTSCTxtS, H - 1) ♦ NewUordS 

END IF 

END IF 

start looking at the next position 
Start = n * 2 

END IF 

' — loop until we don't find any oore section synods 
LOOP UNTIL M = 0 

END SUB 



SUB fteleaseEflS STATIC 

IF KeyEHS THEN EasRelRea KeyEKS . 
IF ValueEHS THEN EasRelNea valuaENS 
IF DictCodeH THEN EasRelNea DictCodeW 
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SUB UerdParse (SentS, LenSen, SentNua, WoniLiatSO, WordsX) STATIC 



Parse Sentence into Uords 



' — Words=iucnb«r of words parsed 

Uords = 0: Slen = 0: Start = 1 
DO UNTIL Start >= LenSen 
Slen = 0 

FOR i » Start TO LenSen * ~ 

chS » HlDS(SentS, i, 1) 
IF ch* • ■ • THEM 
EXIT FOR 

Slen ■ Slen + 1 

END IF 

NEXT 

IF Slen » 0 THEN * 
— extract word 

-s s HiOKSemS, Start. Slen) 

•— fill out 1 and 2 char words with /*s 

1? Slen « 3 THEN wS = wS ♦ STRINGS (3 - Slen, V") 

- allow only words that start with alphabetic char* w a"-"i" or 1-9 
ASCw - ASCCwS) 

IF (ASCw >=. 97 AND ASCw « 122) OR (ASCw >• 65 AND ASCw «- 90) THEN 
Words « Words ♦ 1 

*— the following doesn't apply to GERMAN 
IF LangS = "ENGLISH 1 * THEN 

IF RIBHTSCwS, 2) ° THEM •— reoove the 'a 

wS » LEFTSCwS, Slen - 2) 
ELSE IF Rl6HTS(wS r 1) » THEN »— end en? final * 
WS - LEFTSCwS, Slen - 1) 

END IF 

END IF 

• store the word 

RED IN PRESERVE WordListSCI TO Words) 
UordListS(Uords) o uS 
END IF 

END IF 
Start = Start ♦ Slen - 1 
LOOP 'next word in sentence 

END SUB 

SU8 WriteSentence (Ss-vrS, Best, AbstrPosft) STATIC 

• this procedure wraes sentence, saves hlghltgtlng inforaation and write 

• sentence to the disk. Harking EOS - add 100 to the position of the first 

• highlighted word. 

Wid = 78: OLCStrS = 

'create array of words which should be hihgllghted . 
n*0 

•sove waters into array 

KuaCodes * reyvertFoundNd* (Best). Last.- Keyword FoundNdx(Bwt). First ♦ 1 
REDin KeyWorcsS(1 TO NuaCodes) 

FOR 1 a KeyVC*«FoundNdxCBest). First TO KrywordFoundNdxCBest) . Last 

• to not hlgnliBnt word, appeared in the previous sentence take out this 

IF KeywordFowncti).Nu» *» 0 THEN ' eoaaents. 

IF BCTRlOldStrS, RTRim(KeywordFotmd(i).Str) ♦ « «) = 0 TrSN 
n » n ♦ 1 

KeyvdrdsSXn) = RTRI«J(K»ywcrdFound(i).Str) ' 
OldStrS = OLdStrS ♦ KeyWordsSCn) ♦ ■ ■ 

END IF 

END If 

NEXT 

REDIN PRESERVE. <*yvordsSO TO n) 



•collecting higlighting information 
Sen tC leans * SentS 

CALL ChangeCharCSentC leans, .KeepBoNuaberaS) 
IF LangS = "Gr-W THEN CALL Lower<SentClean$> 
NuaHigh s 0 

REDIN Highligt:-. TO NuaCodes) 
REDIN LenHightl TO NuaCodes) 
FirstTiae = TRUE 
FOR i = 1 TO LEOJHD (Keywords*) 

Start ° 1 

DO 

NuaHigh ■ NuaHigh ♦ 1 
IF LangS » "GERMAN- THEN 

Highligt (NuaHigh) - INSTR(Stort, SentCleanS, KeyWordsS(l) * - ») 

ELSE 

IF FirstTiae THEN ^ ttx 
Highligt (NuaHigh) = OSTR< SentCleanS, UUSESCtf FTSCKeyUordsS(i), 1)) ♦ HIW(1CeyWordsS<i), 

END IF 

HighUgt<NuBHigh) • INSTRC Start, SentCleanS, <eyWordsS(1) ♦ - •) 

END IF 

EMD IF 

LenHighCNuaHigh) * I^OCeyWordsS(i)) 
•if still not found try add ■'a" 
IF Highligt (NuaHigh) a q THEN 

Spue = INSTRCxaywordsS(i). ■ ■) 

IF SpLOC THEN 

KWJ a LEFTS (KeywordsSd), SpLoc r 1) ♦ "'f - NIDSOCeyWordsS(i), SpLoc) 

E1SE 

KWS » KeyVordsS(i) ♦ -'s* 
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END IF 

HighUgt<fcaHigh> a IN3TR< Starr, sentcteenS, KUS) 
LenHigh(ttuflHigh) - LENCKU3) 

EJiD IF 

F1rstT1oe = FALSE 

If HighllgtCNusHlgh) <> o THEN 

If HIWCSentaeanS ' Highligt(NunHigh), LehHic^CNuaHigh)) - STRINWCLenHighCNAmHigh). 32) 

Start ■ HighllgtCNuaMIgh) + Lehigh CBuaHigh) 
LOOP UNTIL HighligtCNunHlgh) s 0 OR NuaHigh = NusCodes Ofl Start *= LENCSentS) 
IF HigrligxCKunHigh) « 0 THEN NumHigh = NunHigh - 1 
IF NunHigh ■ NuoCodes THEN EXIT FOR 

NEXT 

•wrapping 

DIM AbatrLine AS AastrType 
00 

IF LEMSentS) > uid THEN 

LastSpc = ClnstraXCMid ♦ 1, SentS, ■ ") 
IF LastSoc <> 0 THEN 

TeatS = RTRWSClEFTSCSentS, LastSpc)) 

remove portion of string that's been ooved to the Texts 
SertS = NIDSCSentS, LastSpc ♦ 1) 
j *0 

•sayo highlighting inforaation 
?CR i u 1 TO NuaHigh 

IF Highligt(i) <> 0 THEN 

IF HighligtCi) « LastSpc THEN 'save it 
j ■ j ♦ 1 

If j » 5 THEN EXIT FOR 

AbstrLine.UordCj) « CHRSCHighUgtCi}) 

IF HighligtCi) ♦ LenHighCi) > LestSsc THEM 

■probably it is comb, lew whi- ^» wrapped 
AbstrLine.LenthCj) a CHRSCiaitSsc - HfghligtH)) 
LenHigh(i) a LentHgh(i) - (LastSpc - HighligtCi) ♦ 1) 
Hightigt(i) »1 . 

ELSE 

AbstrLlne.LenthCj) a CHRS(L*-*ighCi)) 
HighligtCi) a 0 

— -i if 



ELSE 



END IF 

*£XT 



ELSE 
END IF 



HighligtCi) « Highl1gt<i) - LastSpc 'substract the lanth of line 



Texts = RTRMSCLETOCSentS, Wid» 
SentS « MIDSC SentS, llid ♦ 1) 

END IF. 

FOR k s j ♦ 1 TO 5 

AbstrLine.UordCIc) = CHRSCO) 
AbitrLine.LenthCI;) a CHRSCO) 

NEXT 

IF LENCSentS) a 0 THEN 
'mark EOS 

AostrLlne.UonJCl) a CHRSCASCCAbstrLine.WordCI)) ♦ 100) 

ENO IF- 

AbstrLine.Str = Texts 
'PRINT *?. TextS 

FPutRT AostrFHe, AbstrLlne. AbstrPosS. 68 
CALL OosErrHandler 
AbstrPosJ * AbstrPos* ♦ 1 

ENO IF 

LOOP -ILE LENCSentS) > Uid 
IF LEhCSentS) «► 0 THEN 

TextS - RTRIHSCSentS) 

j . o _ 

'save highlighting inforaation 

FOR-i-a -1- TO -NuaHigh 

IF HighllgtCi) <> 0 THEN 
J - J »1 

IF J > 5 THEN EXIT FOR 
AbstrLlne.ttordCj) a o«SCH1ghl1gtC1)) 
AbstrLine.LemtKj) a CHRSCLenHlghM)) 

ENO IF 

NEXT 

FOR k a i * 1 TO 5 

AbstrLine.UordCk) « CHRSCO) 
AbstrLine.Lenth(k) • CHRSCO) 

NEXT 

AbstrLine.uordCD ° CHRSCASCCAbstrtine.Uordd)) ♦ 100> 

•PRINT W, TextS 

AbstrLine.Str a TextS 

FPutRT AbstrFUe, AbstrLine, Abstrfosl, 68 

CALL DoacrrHandter 

AbstrPoaa ■ AbstrPos8 + 1 

ENO IF 

ENO SUB 

•this prograa aanitixes both english and geroan abstract 

•If you are going to try to understand this prograa then 
•you need at least one bottle of vodka. 
• I feel sorry for you. to ahead. 

OEFINT A-Z 

TYPE AbstrSanType 

Str AS STRING • 78 

Rest AS STRING • 6 

ENO TYPE 
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CONST FALSE - 0, TRUE « NOT FALSE, Sing = 0, Cocb » NOT Sing 

•SINCLUDE: , \u3er\incl«:e\types.bi , 
•SINCLUDE: , \user\inelude\eec Lares. hi' 

DECLARE SUB FindSingKey (UordS; KeyENSX, NeuSentFlsg, SingFlag, Limit) 

DECLARE SUB FindCosbKey (VordPhraseSO, KeyDlSt, CoabFlag) 

DECLARE SUB Eos Alloc CNuaPagesX, Handle!, LcadFlLEi) 

DECLARE SUB Insert Str (3YV4L AddressX, Inserted!, SizeZ) 

DECLARE SUB Conf tg O 

DECLARE SUB LoadData (> 

DECLARE SUB Wrapping (SentS, TextSO, NuoLinesZ) 
DECLARE FUNCTION LoadlntsEMSX (FUeS) 

DECLARE FUNCTION FirstUst" (Words, FlrstZ, LastX, KeyTypeX) 
DECLARE FUNCTION Notlnsrr (StartX, Searched!, Table!) 

COXKON SHARED Fg, Bg, Brer, LstDIrS, DocDirS, NdxDirS, AbstrDirS, Long! 

CONKON SHARED Sixteen*, SiKTyFour. Thirty Two, Thirty TvoKB 

CONKON SHARED Beg, Fin, Nachine!,. Noise! C), SingKeywcrdENSX, CcesbKeyWordENSX 

COWtON SHARED SingTable'.C). ConbTebleXC), XLoteTableXO, NextStart, Articles! 

X $ a mm ■ , 

CALL Conf 1g 
CALL LoadData 

FileNaseS » AbstrDirS *.NDX" 
FOpenAU FIleHaaeS, 2, A, AbstrNdxFi le 
FUeNaaeS « AbstrDirS ♦ ".txt" 
FOpenAU FiLeNaaeS, 2, *, AbatrFUe 
•File! » AbstrDirS ♦ ".TJT ♦ NachineS 
•OPEH File! FOR OUTPUT AS 41 
DIN AbstrNdx AS ISAHtype 
.LenNdx - LEU (AbstrNdx) 
DIN Txt AS AbatrSanType 
LenTxt • LEKCTxt) 

NoFirstS = CHRSCO) ♦ " '<«" ♦ CHWC34) 
UpTableS = "QUERTYUIOPLJUHGFDSAZXCVBfm" 

RED IN 6eraanNotKamSC3 70 9) 
FOR i = 3 TO 9 

READ GermanNotNaseSd) 

NEXT 

Raw EnglifthnotNaaeS(2 TO 6) 
FOR i 3 2 TO 6 

READ EnglishNotSweeSCi) 

NEXT 

GemanData: 

DATA -\1on\l»^g\-,•\he1t\nz1p\saxz\keit\^,"\igllen\nz1p5\UlIgen\w1ele\ , • 
DATA "\koaten\si^ft\heTten\lceiten\«ant^ 
DATA •\schaften\■ehreren\•,"\pfl^chten\ ,, • 

EnglishData: 

DATA "\el\", B \ion\ity\aU\"»"\aent\ienaAei>cy\e*»c^^ 
DATA °\«ental\eneies\Brcie3\ - 

RED IN GeraanNaaeSCI TO 2) 
FOR i - 1 TO 2 

READ GeraanNaaeSCi) 

NEXT 

DATA •\er\cu\", B \>am\i«rg\berg\" 

RED IN EnglishNaaeSCI TC A) 
FOR i = 1 TO A 

READ EngUahHaeeSCi) 

NEXT 

DATA "\o\i \" ,-\er\rg\os\ey\tz\" , "\aan\son\0ng\haa\t0n\5en\- , "\aartn\te1n\- 



a! o -\ein\eine\eine«Ve<f>en\e1nes\d1ese\d1eses\d1eser\diesen\d1esea\d1eser" 
bS a "\kein\keine\kein«- \ceinen\ltein«i\keirtes\' 
engS 0 •\a\an\any\this Si»cn\no\5everal\aany\ 0 
IF Lang! = "GERNAN" ThEN 

Art1cles1$ = a! * bl 

Art1cles2S = •\ser\die\das\den\den\des\- 

ELSE 

ArticlesIS « e*g! 
Articles2S - *\t*e\" 

0© IF 

IF Lang! - "GERMAN" ThSS 

SectionLineS « '\ArtikelYArt\Art.\Paragraph\Par\ParA" 

ELSE 

SectionLineS ■ '\ArticLe\Amcles\Art\Art.\Seetion\Seetions\See\See.\T1tle\* 
SectionLineS «* SectionLineS ♦ ■Titles\Par^aph\Subparegraph\Chapter\Chapt5-3 ■." 

END IF 

ieeeeeeeee eeeeee START »*« w *«*wmwt>>twwtmmw»wH« 

CLS 

FOR DoeNual = Beg TO »— . 
FreSpi = FRE(°") 

OPrintRC ■Oocuaem- - STlS(DocNua&>. 15. 23, -1 
AbstrLineS = 

FGetRT AbstrNdxFi le, AsstrNdx, DocNuaS, LenNdx . 
- NewSentFlag 3 TRUE 

NuaberOf Lines* = AbstrscxiLast - AbstrNdx. First ♦ 1 
'. IF NuaberOf Lines* > : AkD NuaberOfLinesS < 300 THEN 

RED IN HighH;r-*( 1 7° NusDerOfLinesR) 'highlighting intonation, just 
1 a 1 • temporary keep It. 

FOR LlneNuai a AbstrNdx. First TO AbstrNdx. Last 

FfieiRT. AbatrFUe, Txt, LlneN&ag, LenTxt 
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Assi nines '= AbstrLineS + Txt.Str ♦ ■ ■ 
*ighlightSCi) = Txt.Rest 
i = i ♦ 1 

PRINT Txt.Str 
NEXT 

Start • 1 
DO 

Ctabf tag * FALSE: SingFlag *» FALSE: Sectlonf lag a FALSE: Hyphenf lac = FALSE 

UpCaae.;- - InstrTbKStort, AbstrLineS, UpTableS) 

IF UpCsaeLst > 0 THEN 'is there upper-ease word? 

i? UpCaseLet « 2 THEN ' the first word In The abstract? 
NewSentFlag a TRUE 

ELSE 

•check if it is proceeded by '.• or or •>- ( i.e. if s a new sentence 
PrevPos = 2 

chS s HIDSCAbstrLineS, UpCaseLet - 2, 1) 

00 WHILE (ChS < -a" Oft chS > -Z-) AND (chS < -A" Ot enS > "Z") AND InstrTbl2C1 
PrevPos = PrevPos * 1 

chS a HIDSCAbstrLineS, UpCaseLet - PrevPos, 1) 

LOOP 

IF InstrTbl2C1. chs, ".:)•) THEM 
NewSentFlag = TRUE 

ELSE 

NewSentFlag = false 

END IF 

END IF 

'extract the upper-case word 

LastLat 3 INSTRCUpCaaeLet, AbstrLineS, " ") ' ~ 

firstUordS a HIDSCAbstrLineS, UpCaseLet, LastLet - UpCaseLet) 
'tace out cocoas, quotes ate. 
chS = RIGMTS(FirstUordS, 1) 

DO WHILE CchS < -a" Oft chS > T) AND CchS < "A" OR chS > "Z") 
IF LEN(FirstUordS) > 1 THEN 

FirstUordS = LEFTS(FirstWorcS, LENCFirstWorcS) - 1) 
- chS s RlGHTSCFIrstvordS, 1) 

ELSE 

EXIT DO 

END IF 

LOOP 

'if it is UU-U don't sanitize it 
HyphenLoc " INSTRC FirstUordS, "-") 
IF HyphenLoc THEN 

IF HIDSCFirstVordS, HyphenLoc ♦ 1, 1) w - a " AND HIKC FirstUordS, HyphenLoc ♦ 1, 1) .<* THEN 
HyphenFleg ■ TRUE 

END IF 

END IF 

IF NOT HyphenFleg THEM 

•compare word with noise word list if it a new sentence only 
IF NewSentFlag THEN 

NuaNoise =» UBOUNOCNoiseS) 

CALL F i ndExact CBYVAL VARPTRCNoiseSCI » , NuchcHe, FirstWordS) 

ELSE 

NuaNoise = -1 

ENO IF 

IF NuaNoise a -1 THEN 'it is not a noise word, cheek for the 

REDIN UordPhrase$(1 TO 5) 'coobincd kw 

UbrdPhraseSCI) a LCASESC FirstWordS) 

UPos a NotlnstrCLastLat * 1, AbstrLineS. No?- <■»-$)" 

IF UPos > 0 THEN 

i = 1 

DO 

LastLet = WSTRCUPes, AbstrLineS, • "> 

NextuordS * HIDSCAbstrLineS, UPos, -SSTLet - UPos) ' 

chS a BIGHTSCftextUordS, 1) 

DO WHILE (chS « "a" OR chS > m z*y as; (chS < "A* OR chS > "Z") 
IF LENCNextUordS) > 1 THEN 

NextuordS a LEFTSCSe*tUbndS, LENCNextUordS) - 1) 
ens » RiarrscNe*ti^-ss, i> 

ELSE 

EXIT 00 

— - , 

IF LENCNextUordS) > 1 THEN 

VordPhraseSCi ♦ 1) » LCASESCHextwordS) 
1 a 1 + 1 

END IF 

UPos a HotlnstrOastlet ♦ 1, Abstr.ineS, NoFirstS) 
LOOP UNTIL i 3 4 Oft UPOS = 0 

FindCoabKey UordPhraseSC), CcabKeyUcrdENSX, CosbFlag 
IF NOT CosbFlag THEN 'not a coabinee <- check further 
IF NOT NewSentFlag THEN 

'extract previous word 

Prev a QlnstrBCUpCasr-et - PrevPos, AbstrLineS, - "> 
PrevUordS a HlOSCAbsrrLtneS, Prev ♦ 1, UpCaseLet - 2 - Pre/) 
'take out coesas,quc:et ate. 
ehS e LEFTS C PrevUoreS . 1) 

DO WHILE CchS < -a- chS > "r») AND CchS < "A" OR ChS > "Z") 
IF LENCPrevwcrdS) > 1 THEN 

Pre.yerds a HIDSCPMvUbrdS, 2) 
chS ? LEFTSCPrevttordS, 1) 

ELSE 

EXIT DO 

END IF 

LOOP 

'check. for coab. kw, beginning frca previous word 
FOR i » 1 TO * 

UordPhraseSC i ♦ 1) = uordPhraseSCi) 

NEXT 

UordPhraseSCi). - La S£S (PrevUordS) 
FiMjCocbXey wordPhreseSC), CccbJCeyUordENSX, CosbFlag 
ENO IF 'not NewSentFlag 
' END IF 'NOT CosbFlag with previous word 
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END IF 'UPos -the last word in the abstract 
IF NOT tori* Lag THEN 'still not a coobined «- check further 
■check against "Sect ion", etc. 

IF INSTRCSectionLineS, "\" ♦ FirstUcrcS ♦ *V> THEN 

Sect ion Flag - TRUE 
END IF ■ * 

IF NOT Sect ion Hag THEN 

'check far ending of the wore against eot-name ending 

NotNaoeEndf lag = FALSE 

IF LangS = ■6ERHAN" THEN 
FOR i s 3 TO 9 

IF INSTRCGeraartfetftaaeSU), *\" ♦ RIGHTS! FirstUordS, i) ♦ -\-) 
NotKaaeendFlag » TRUE 

FincSineXey FirstUordS, SiiqKeyuordEHSX, NewSentFlag, 

exit ?y 

END IF 

NEXT 

ELSE 'engllsh 

FOR i a 2 TO 6 

IF lMSTRtEng;.isr*0tN3aeS(1), "V ♦ RIGHTS (FirstUordS, i) ♦ "\» 
NothaaeEndFlag = TRUE 

F1ncS?nc/ey FirstUordS, SingKeywordEMSS, NewSentFlag, 
EXIT 

END IF 

NEXT 

END IF 

IF NOT NotKaaeEndFlag THEN 
'ending of the word was net found, check further 
'check if the previous wore is an article 
IF NOT NewSentFleg T— N 

Prev - Qlnsr-2'vJpCaseLet - PrevPos, AbstrLineS, " "> 
PrevUordS » -ISSUbstrLineS, Prev * 1, UpCaseLet - 2 - Prev) 
PrevUordS » *\* * PrevWordS ♦ *\" 
IF INSTR<Arti=iWl$, LCASES(PrevUordS)) THEN 

FincS-ng<ey FirstUordS, SingKeywordEHSX, NevSantFlag, 

END IF 

END IF 

IF HOT SingFlag THEN 

•no articles* snack further 

'check for ending of the uord against naaa ending 

NaoeEndFleg = =XLSE 

IF LangS a S8MM m THEN 

FOR *. s | 70 2 " 

IF INSTRCGemanKaaeSCi), M \" ♦ RIGHTS ( First 'Jo 
NaaeEndFlag = TRUE 

FindSingKey FirstUordS, SlngXeywordEH 
EXIT FOR 

END IF 

NEXT 

ELSE 

FOR * m 1 TO 4 

IF INSTRCEngllshNaieeSCi), m \ m ♦ RISHTStPlrstu 
NaaeEndFlag = TRUE 

FindSingKey FirstUordS, SingKeywordEH 
EXIT FOR 

END IF 

HVT 

END IF 

IF NOT NeneE'cF-.ag THEN 

'ending of tre word wea not found, check further 
•ehes« if the previous word is an articled 
IF «t NewSentFlag THEN 

IF IMSTRCArticles2S. LCASESCPrevUordS)) THEM 
IF LangS * "GERMAN" THEN 

FindSingKey FirstUordS, SingKe 

ELSE 

FindSingKey FirstUordS, SingKe 

END IF. 

END IF 

IF %0T SingFlag THEN 

•no articles? check further 
IF LEB( FirstUordS) >= 8 THEN 'no sense to 
FindSingKey FirstUordS, SlngKeywordEN 

END IF 

IF NOT SingFlag THEN GOSUB ChangeUord 

END IF 

ELSE 'if NoaeErcFlag - TRUE 

IF hCT SingFlag THEN GOSUB ChangeUord 
END IF 'NaneErwflag 
END IF 1 SingFlag 
ELSE 'if NotHaaeEndFlag - TRUE 

IF NOT SingFlag THEN 3CSU8 ChangeUord 
END IF 'HotNaeeEndFlas. 

END IF • Sect 1onF lag 

END IF 'NOT CoabFlag without previous word 

END IF 'not a noise word 

END IP 'not Hyphen-lag 

END IF 'UpCaseLet > 0 

IF Coabf leg THEN 

Start Q UpCaseLet ♦ NextStart 

ELSE 

Start ■ UpCaseL«et ♦ LEN( FirstUordS) 

END IF 

LOOP UNTIL UpCaseLet « 0 
RED IN TextSCI TO 1) 

CALL UrappinglADstrLineS, TextSO, NuaLlnes) 
FOR i o 1 TO NuaoerOfLinest 

IF IXSTaiTcxtSCi), "XX") THEN 'this line was 

Txt.Str = TextSCI) 

7* t. Rest a HighlightSCi) 



05/25/2004, EAST Version: 1.4.1 



163 



5,404,514 



* PRINT #1, Txt.Str 

FPutftT AostrFile, Txt, CLNGCAbstrNdx. First ♦ i - 1), LenTxt 

end IT 

NEXT 

END IF 'HuBberOfLinesa valid range 
qS = INXEYS 

IF qS a CHRSC27) THEM EXIT FOR 
NEXT 'document 
FUcse AbstrFlle 

END 

ChangeVord ! 

•not a single kw - change 1t! 

IF RIGHTS CFIrstiiOrsS, 1) a THEN FlrstUordS = LEFTSCFIrstVordS, LEN(Ftrst-Cr3J) - 1) 
NIDKAbsirtineJ. tpCaseLet. LENC FlrstUordS)} = STRINGSCLENCFIrstuordS), S3) 

RETURN 

SUB Config STATIC 
CodS = COHRAHDS 

Porno a In Count (CodS, " ") ♦ 1 • — number of parameters 
IF Parms « 4 THEN 

• Expected information on cooaand line: 
' Con-fig file, f-irst Ooc, Last Doe 

Extract CodS, " \ 1, Strt, SLen •— extract first para 
. DBNaeeS = HIOSCCsoS, Strt, SLen) 
ConfigFileS = OSHaaeS ♦ ".CF6" 

Extract CodS, " 2, Strt, SLen extract second pare* 
Machines = HIDSCCadS. Strt, SLen). 

Extract CadS, ' ". 3, Strt, SLen '— ' extract third para 
Beg a VAL CHIOS ICacS, Strt, SLen)) 

, Extract CodS. * ", Strt, SLen extract fourth para 
Fin = valOiidsccssS, strt, SLen)) 



ELSE 

PRINT 

PRINT "SANITIZE Program Error: Hissing Parameters" 

PRINT 

PRINT 

PRINT "Required 'araaetera are:" 
PRINT 

PRINT "SANITUE Csnfig File Hachine First Doc Last Ooc" 
- PRINT 
Chine 10 

PRINT "Press the SPACE BAR to exit:" 

iS = 1NPUTSC1) 

END 

END IF 

OPEN ConfigFileS FOR INPUT ACCESS READ SHARED AS 01 

INPUT 01, Fg, Bg, Brer. LstDirS, DocDirS, NdxOirS, AbstrDirS, LangS 
CLOSE 01 

i - 1 

OPEN LstDirS ♦ "NOISE. OAT* FOR INPUT ACCESS READ SHARED AS 01 
00 UNTIL E0FC1) 

RED IN PRESERVE NaiseSCI TO i) 

INPUT 01, Hoi MS (i) 

nS o LEFTS CNoiseSCi), LENCNoiseSCO) - 1) 

NoiseSCi) » UCASESC LEFTS CnS, 1)) «- HIDSCnS, 2) _ 

— - r ri vi 
LOOP 
CLOSE 01 

IF HOT BasLoadedZ THEN 
chlee 8 

PRINT "The EHS r*s not been loaded." 
STOP 

END IF 



Sixteen* - 16 * 1024 
Sixty Four => 64 
ThirtyTwo « 32 

Thirty TwoKa a ThirtyTwo • 10248 



END SUB 

SUB EaaAlloe (MunPagesX, Handle*, Load FILES) STATIC 

EasAllocHea NunPagtsX, HanoleZ 
IF EasErrorZ THEN 

PRINT "Couldn't allocate"; CLNSCNuaPages) • Sixteen*; "bytes of EHS for "; jcacFILES 
Chioe2 

00: LOOP UNTIL LrNUKttTS) = 0 

«S ° INPUTS CD 

END 

END IF 
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SUB MndtcnfcKey (UordPhraseSO, KcyBCX, CoabFLag) STATIC 
CoebFleg = FALSE 

DW KeyTesp A3 CosbKeyTyoe entire Combined Keyword 
LEKKey = lEN(KeyTenp) . 

Slash* » •///" 

ASCslesh » ASCtV") 

•_ if u*a a valid range. Then check words in range 
IF FirstL3st5(udrd?hraseS(l), First, Last. Coab) THEN 1 
FOR j « wast TO First STEP -1 
• — get word froo Coabined Keyword List (CGHBKEY.STR) 
EasGetTEl Key reap, LENKey, j, KeyENSX 

convert it to a variable- length string tor speed 
KeyTeacStrS a RTRIHS(KeyTeap.Str) 

Words = inCountCKeyTeopStrS, " ■) ♦ V count nuaber of words 

IF Uorss <= 4 THEN 

CALL Extract(KeyTeapStrS. " \ 1, Strt, SLen) 'extract first -c-d 
CurrKeyS a niOS(KeyTeBpStrS, Strt, SLen)' of combined ke-vcrd 

IF HldCharX(CurrKeyS, SLen) 3 AS Cs lash THEM 
Exact « TRUE 

CurrKeyS = LEFT J( CurrKeyS, SLen - 1) 
SLen = SLen - 1 

ELSE 

Exact = FALSE 

"coapare first word of combined Key [CurrKeyS] 
'against the current docuaent word CUordTempStrS} 

IF LongS » "6EJ0UW* THEN 
IF NOT Exact THEN 

Hatch o (CurrKeyS » LEFTS (UcrdTeapStrS, SL*0) 
ntf ' check for * exact* natch 

Hatch = (CurrKeyS = UordTeepStrS) 

END IF 

ELSE 

IF NOT Exact THEM 

Hatch a (LCASES (CurrKeyS) = LEFTKUordPhraseStD, SLen)) 
ELSE • check for "exact* eaten 

Hatch = (LCASES(CurrKeyS) a UordPhraseSCI)) 

END IF 



' no satch, skip to next co-binod key in the First-Last range 
I? NOT Hatch GOTO SkipCosbKey 

• continue matching the rest of the words in the coabined key 
' exiting out as soon as there's a nan-Batch 

At Flag » FALSE 
no; P lag = FALSE 

fCU » 2 TO words' nuaber of words left in coabined key 

• extract the next word from the current coabined Keyword <j) 
CALL Extract(KeyTeBpStrS, - V k, Strt, SLen) 
CurrKeyS ■ niDSOCeyTenpStrS, Strt, SLen) 

IF nidChorXC CurrKeyS, SLen) ■ ASCslesh THEN 
Exact » TRUE 

CurrKeyS - LEFTS (CurrKeyS, SLen - 1) 
SLen = SLen - 1 

ELSE 

Exact = FALSE 

END IF 

IF AtFlag a FALSE AND Not Flag = FALSE THEN 
PQcUo rdS a wordPhraseS(k) 

IF AtFlag a FALSE AND NotFlag a TRUE THEN 
DocvordS = UordPhraseSCk *■ 1) 

ELSE 

IF AtFlag a TRUE AND Not Flag = FALSE THEN 
DocvordS a Word Phrases Ck - *) 

ELSE 

DocwordS = UordPhraseSCk) 

END IF 

END IF 

END IF 

IF English THOi Lower DocuordS 

IF ASCII (CurrKeyS) <> ASCC3 - ) THEM 
IF LangS = "GERMAN* THEN 
*— Geraan: no need to use LcaseS 

IF Exact THEM ' check for * exact* eaten ' 

Match * (CurrKeyS = OocUorcS! 
ELSE ' wildcard aatoh, only cospare - of chars in CurrKeyS 
Match a (CurrKeyS a LEFTS (Dcc-crdS, SLen)) 

END IF 

ELSE 

IF Exact THEN ' check for 'exact* Bates 

Match a (UASESCCurrKeyS) a 3ccuordS) 
ELSE ' wildcard eaten, only coapare.s of chars la CurrKeyS 
. Match a (LCASES(CumCeyS) a UFTS(DocUordS, SLen)) 

END IF 

END IF 

ELSE special processing for 8 wildcard 

IF INSTRlAtLlstS, V ♦ OocWordS ♦ "/") THE* 

Match a TRUE' the word was in the S list, so continue 
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else 

IF English THEN 

hatch ° FALSE 

ELSE 

Hatch a TRUE 
AtFleg - TRUE 

END IF 

END IF 

IF Hatch THEN 

DocUerdS - VordPhrascStt ♦ 1) 
END 1F IF CtocWon3S " <* OocUordS = "te- OR DocuordS a "nichf THEN Mot Flag a TRUE 

END IF 

IF NOT hatch GOTO SkipCoebKey 
NEXT* word 1n current combined keyword 

IF natch THEN 1 this 1s a coabined keyword, so add it to the list 
Coa&Flag = TRUE 

NextStart = LENOCeyTeopStrS) • - 

EXIT FOR 

END IF 

END IF 

SkipCoabKey: 

NEXT 

END IF' Table range was valid 
END SUB * 

SU8 FlndSlngKey (Itords; KeyENSX, NewSentFlagFS. SingFleg, Liait) STATIC 
Count o 0 

IF LEHCWontt) < Limit THEN EXIT SUB Mt can't satisfied us 
DIN KeyTeap AS SlngKeyType Single Keyword to be cornered 
LENXey = LENC KeyTeap) »i—ao 
AS Cs lash = ASCCV") 
SingFlag a FALSE 

IF RIGHTSlFirstUordS. 2) a THEN FlrstUordS a LEFTS<Fi>stUordS LENtFintunrrict - 3 , 
IF LENC UordS) < 3 THEN UortfS a word* ♦ STRIN5S(3 - lEN(Uordi) ASC("/^) 23 
IF Langs « "GERMAN" THEN Lower UordS ' " 

• check if the first 3 letters of the word return 

• a valid range froa the 3-dinensional table array 

IF Fir*tUstX<LCASES<UordS), First, Last. Sing) THEN ' yes, so search thru range 

FOR j = Last TO First STEP -1 

get the word froa the 3INSKEY.STR list 

EosGetlEl KeyTeap. LENXey, J. KeyEHSX 
CurrKeyS a RTR INS (KeyTeap. $tr) 
SLen - LEN(CurrKeyS) 

compare the single keyword CCurrKeyS/KeyTeap.Str] 
against the document word CttordS: 
'= Creplaceda IF RIGHTS (CurrKeyS, 1) = "/• THEN 

IF HidCharX<CurrKeyS, SLen) - ASCslash THEN 
CurrKeyS = LEFTS<CurrKeyS, SLen - 1) 
hatch = CCurrKeyS a UordS) 

ELSE 

hatch a CCurrKeyS = LEFTS (UordS, sien)) 

END IF 

IF Hatch THEN 

IF LEW (CurrKeyS) >= Unit THEN 
SingFlag a TRUE 
EXIT FOR 

END IF 

END IF 

. NEXT_!_key_in range 

IF. NewSentFlagFS THEN 

'iTi^t .^[^jr^ ""*• ™" " *•» « «• *• 

UordS = LCASES(«ordS) 

END IF 
Count a Count ♦ 1 

END IF 

LOOP UNTIL Count > 1 OR SingFlag 
END IF* the range was valid 



DEFSNG A-Z 

FUNCTION FirstLastX (words. First*. LastX. KeyTypeX) STATIC 

■— returns the starting (First) and ending (Last) range for the word 

by looking it up in the TableZO array 
FirstX a o 
aO a ASCI I (UordS) 

IF aC > 37 AND aO « 123 THEN 
a1 a HidChar (uordS, 2) 
IF a1 > 37 AND a1 « 123 THEN 
a2 = HidChaKUordS. 3) 
IF e2 > 37 AND a2 « 123 THEN 
a > XLateTebleX(BO) 
b « xuteTableX(el) 
c a XLateTableX(a2) 

IFo"00Rb*0CRc = 0 THEN FirstLastX = (h EXIT FUNCTION 
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IF KeyTypeX » Sing THEN 

FirstX = SingTableWa, b, e, 1) 
LastZ s SfngTaMeZ( a ; b, c. 2) 

ELSE 

FirstZ s CoebTablcXCa, b. c, 1) 
Last* a CocbTabieX(a, b, e, 2) 

END IF 

END IF 



* — Return FALSE if there «as no valid range (i.e., Ftrstt=0) 
FirstLastX " (FirstX <> 0) 

END FUNCTION 

DEFIHT A-Z 

SUB LoedData STATIC 

' Read in Ccabined Keys 

Load FILES = LstOirS ♦ "C0H8KEY.STR" 

IF NOT Exist* (Load FILE J) THEN CLS : PRINT LoadFILES; ■ not found.*: END 
DIN CccbKeyTeep AS STR6& 

NuaCoabKeyword « FileSi«e5(LoadFILES) \ Sixty Four 
CoabKayUordEMSZ = LoadlntoEHStt LoadFILES) 



Rtad Single Keys 



LoadFILES = LstOirS ♦ *"SIM3CEY.STR" 

IF NOT ExistXOoadFlLES) THEN CLS : PRINT LoadFILES; ■ not found." 

Din SingKeyTeap AS STR12 

NunKeyword = F1leSl2eS(LoadFlL£S) \ Thirty Two 



SingfCeywrdEBSX = LoadIr.t=ENSX( LoadFILES) 



. Read >-char Tables 

Symb - 28: First - 1: Last « 2 
RED IN XLeteT8bleZC38 TC 122) 

REDIH SingTablettl TO S>w, 1 TO Synb, 1 TO Syob, 1 TO 2) 
REDXH CcabTableXd TO Sy», 1 TO Syob, 1 TO Syob, 1.T0 2> 

XLeteTeble(47) ■ 1' / char, as used in non-wildcard words 

XLateTableGS) * 2' & cnar, as used in SKP, A&P, etc. 

ASCa - ASC("o°) 

ASCI = ASCCz") 

FOR i = ASCa TO ASCj 

XLateTable(i) = < - 94' so that a=3, b=4,...,z=28 

NEXT 

FSetAH LstDirS + ■KETUCTO.TBL", SE6 SingTabteXd, 1, 1, 1), (A * 28), (28 • 28) 
FGetAH LStDlrS ♦ "KETCOIB . TBL* , SEG COfifaTableXd , 1, 1, 1), (A * 28), (28 * 28) 

END sua 

FUNCTION Load UTt OEMS (FileS) STATIC 

« Returns the handle where the file was loaded Into '• — 

EWSPg = EnsGetPFSegX 

SUeof FileS = FlleSireS(MleS) . 

NuoPages = Siteof FileS \ SixteenX + 2' round off to nearest 2 pages 
EasAlloc HuaPages, FileSNS, FileS 

Nun32kfllodts - sTzeottUe^X^r^TvoJa ~ T . r 

Leftovers - Sireof FileS - <Nu»32xB Locks * ThirtyTwoXS) 
FOpenAll FileS, 0, 4, LoadFILE 
FOR i » 1 TO Kua32kBlock* * 1 

BoxO 14, 10, 18, 70, 2, -1 
PaintSoxO U, 10, 18, 70, -1 

OPrintRC "Loading - ♦ FileS ♦ " block" ♦ STRS(i) ♦ • /■ ♦ STRS(Nua32kaiocks - 1) ♦ ■ ", 16, 12, -1 
mop pages of the ENS aeaory to the BIS upper eee page f raae . - 

FOR j « 1 TO 2 

EesHapHea FileENS, J, <1 - 1) • 2 ♦ j 

IF EasErrerZ then PRINT "Eas error:*; EesErrcrX: STOP 

m NEXT 

»— seek to beginning of current block 
FSeek LoadFILE. (1 - 1> » ThirtyTwott 

IF DOSErrorS THEN PRINT "Dos Error:"; Uhi chErrorX: STOP 

IF 1 < Nus32kfllo«ks ♦ 1 THEN 

'— get the 32k block and put it directly into the ENS page fraee 
FGetA LoadFILE, BYVAL EHSPg, 8YVAL 0, Thirty Tuott 

IF OOSErrorS THEN PRINT "Dos Error:"; EirormgS(WhlchErrorX): STOP 

ELSE • 

load the left over (<32k) bytes 
* FGetA LoadFILE. BYVAL EHSPg, BYVAL 0, Leftover! 
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IF OOSError! THEN PRINT B 0cs Error: VErrefnse^(UhichErrorZ): STOP 

END IF 

NEXT 

FClose Load file 

ClearScrO K, 10, 18, 70. KoroAttr 

LoadlnroEM = FIleENS 

. END FUNCTION 

SUB Wrapping (SentS. TextJO, NumLlnes) STATIC 
Wid = 78 

haxNuiiLines = LEN(SentS) \ uid ♦ 3 
RED IN TextSd TO HaxNuoLines) 
KuaUncs - 0 
00 

'— increment Kualinese counter for nuaber of NuaLineses of tent 

NunLines = Hu&Lines ♦ 1 

IF NunLines > HaxNuBLines THEN 

REOIH PRESERVE TextSd TO NuaLines) 
END IF 

'— » look for the it space so we can word wrap at that point 
NewSent = QlnstrSUUid * 1, SentS, CHRS(D) 
IF NewSent > 2 THEN * 

IF RIGH7S(RTRIHS(SentS), 1) = CHRSd) THEN 

TextS(NuoLines) = RTRIB$(LEFTS( SentS, NewSent)) 

ELSE 

TextSCNuBLinea) = RTRIHS(LEFTS( SentS, NewSent - 1» 

END IF 

SentS 9 Wi DSC SentS, NewSent + 1) 

ELSE 

LastSpc = QlnstrBttuid + 1, SentS, " ") 

TextS(NueLines) « RTRins(LEFTS(SentS, LastSpc)) 

*— remove portion of string that's been moved to the TextSO array 

SentS ■ niOSCSentS, LastSpc ♦ 1) 

END IF 

LOOP WHILE LBKSentS) > Uld 
SentS = RTRIMSC SentS) 
IF LEN (SentS) THEN 

NunLines ■ NuaLines t 1 

IF NuaLines > HaxNusLines THEN 
RED IN PRESERVE TextSCI TO NunLines) 

END IF 

TextS(NunLines) » SentS 

END IF 
END SUB 

DEFINT A-2 

•STITLE: 'Q-Search User Program' 
'SSUB TITLE: 'OSEARCH Nodule* 

CONST FALSE = 0, TRUE = NOT FALSE, ASCEND = 0, DESCEND - 1 
CONST HaxShow ■ 50 

'scan code 200 for not to six with letters 

CONST UP = 272, PgUp » 273, Dn » 280, PgDn • 281, HN = 271, EN ■ 279 
CONST CtrlPgUp * 332, CtrLPgDn =• 318, CtrlHN • 319, CtrLEN = 317 
CONST F1 = 259, F2 » 260, F3 a 261, F4 * 262, F5 » 263 
CONST F6 o 264, F7 = 265, F8 = 266, F9 « 267, F10 ° 268 

CONST ESC = 27, CR = 13 

CONST Newsearch = 1, AdsUords = 2, Edi Search = 3, Back = 4, Forward • 5, SWAPS = 6, 3 =< 7 

'$ INCLUDE: '\\VADIH\C-ORIVE\ttSer\DICLUDE\TTPES.BI' 

'SINCLUDE; t \\VADIH\C-ORIVE\user\includeVdefcnf.bi ' 

•SINCLUDE: 'WVADinXC-DRIVEVuserMNCLUDEXQ&ilType.BI 1 

• SINCLUDE: ' \\VADIH\C-ORIV£\user\INCLU0E\shared.ai 1 



•SINCLUDE: 1 \\VAD IN\C-ORIVE\ttser\EXTERN . BAS * 



• External Declarations 



DECLARE SUB InitNea (Sega!, AddrX, NumByxesX, Value!) 
DECLARE SU8 CEdit (ArraySC), xS, Action!, Ed AS Edit Info) 
DECLARE SUB Find Exact CSYVAL AddressX, NuaElZ, SearchlS) 

_ Internal Declarations 

DECLARE SU3 AddSearchTera (Expr AS ExpressionType) 
DECLARE SUa AddSentence (Expr AS ExpressionType) 
DECLARE SUB Bui IdCeabTable (RodeS) 
DECLARE SUa Chang eChar (T*tS, NewS, KeepS) 
DECLARE SU8 ClearBG () 

DECLARE SU3 Code2Str (StoreS, Location! , Code!) 

DECLARE SUB Config <) 

DECLARE SUB CPrint CxS) 

DECLARE SUB CreateTables () 

DECLARE SUB DeleteWord CExor AS ExpressionType) 

DECLARE SUB Cut Word (KWcrdS) 

DECLARE SUB DlspHsg CrtsgS. RZ, cU 

DECLARE SUB DIspOSL (LibNaseS, ScrnNaseSl 

DECLARE SU8 DrawBox (ULScvl, UlCoU, LB Row!, LRColZ, Frame!, Col!) 
DECLARE SUB EnsAlloc CitusPages!, Handle!, LoadFtleS) 

DECLARE SUB FlndCosttCey C-crdListSO, KuaWordZ, ttobFeundSO, NuaCoebfound!) 
DECLARE SUB FindSingfey (VordListSC), NuaiterdX, SingFoundSC), NueSingFound!) 
DECLARE SUB Freehand les () 

DECLARE SUB FullText <Fir*t8, Last&, FileNuafiC, ExprS) 
DEC. RE SUB HistHessage <} 

DECLARE SUB InsertWord (Exsr AS ExpressionType) 

DECLARE SUB Lib2Scm (KeaelnLibS, ScrnLibtt), RonoCodeX, Attribute!. ErrorCode!) 
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DECLARE SUB UadOata O 

DECLARE sua LeadPrefiie, (Pref IxesSO. Runrnf <<^i» i»~*«> 
MOM SUB oth.rUonb CE«r « Expr^l^, i^)^ 

£3£ ^ jess "m^j^st **• ^ » ^~<«^, 

• SS Sb ~ E « 

DECLARE SUB ReadGer«nTe*T (TxtS) 
DECLARE SUB Reference* (TextS) 
DECLARE SUB ReUHteK.st (CurpJQ ' 
DECLARE SUB Sections (TxtS) 

sss s skk as—- «— » 

DECLARE SUB ScrSR <SR$) rvunumj 
DECLARE SUB ShouAbitf (RecNuag) 

DECLARE £ SSS^Mr-" ^ » 
DECLARE SUB Shortuery (> 

DECLARE SUB Teranateh (TeraS, Exnr AS Exnre^ionTWf . c r . , 

DECLARE SUB VertNenu CItSc ,^^f^!T^S B S ) ^ll? 7 *"; J"**' ExactFlag) 

DECLARE SUB WattSpace « *** LCT *' ™ « Conf< 5 , Nodes, KK, EJC, PgUK, PgOK, ux, OX, TeraTypenOde 

DECLARE FUNCTION OictS CCodeZ) ' Wy <>) 

OECLARE FUNCTION DictSrchX (u AS DletType) 

DECLARE FUNCTION FirjcUitZ <UortS, First*, Last*, KeyTypeX) 

DECLARE FUNCTION Keylnstrt (KeyStrS, SrchJ) ^ 

DECLARE FUNCTION Keyrties (KeyStrS, StartZ) 

OECLARE FUNCTION Question! (Prompts, Choices ub«is) 

DECLARE FUNCTION NuaS ' 

DECLARE FUNCTION Spacers} <x2, SpaceZ) . . 

OECLARE FUNCTION Str2CoceX <SS, tt) 
DECLARE FUNCTION ZerofesS CxT.. Zerott 



ON ERROR GOTO Again 

STACK 6000 

Restart: 

REDin ItenuS(l) 

RED IN GereanHenuSCI) 

Load edf lag = false 

CALL Con-fig 

DIN Expression AS ExpressiooTypa 
Excludes = - 



DATA "Nenu SeLecttons:* 
DATA "Related Topics I T' 

. DATA -Add search words a" 
DATA -Delete a Word . ! D" 

OATA "Restore Deleted «crd i R" 

PATA " . _» 

OATA "View Oocuaents ; V" 

DATA " !— - 

DATA "Hi 
DATA "— 

DATA -Exit Prograa • j X* 

DATA "END" 

NenuGermanData: 
DATA "NENUE:" 

DATA "Verwandte Theaen j T" 

DATA "Fuege Wort Hiruu j F" ' 

DATA "Loesehe Wort L M 

DATA "Hoi wortmweek H" — 

OATA " _» 

OATA "Swche Funcstelle- { S" 
OATA * 

OATA -Heue Sucheingabe I N" 

OATA L- - 

DATA "Exit System J x" 
OATA "END" 

MjBOota: 

DATA "or- # -ars«,-Kess^l^-sen^-r€p■,-03^•dr^-drs- .* 
Section Data: 

OATA "Section", "SecVSec." 
Artieleteta: 

OATA "Article-, "Art.", 'An" 
ParagraphData: 

OATA "Paragraph", "Par". "Per." 
ArtikelData: 

DATA ■Artikal u ,"Art",-i-t." 
MuaData: 



EngUshStrS a "tadrvm" 
GeroanStrS a "TFLHSNX" 
IF LangS a "GERMAN" THEN 

HotStrS a 6ersa-iStrS 
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ELSE 

HotStrS * EngVshStrS 



. END IF 
'SPage 



START OF PROGRAM PLOU 



NevSearch: 

ClearBG 

OPrintRC -| Q-Sesrch |\ 1, 36. RevAttrZ 
GlobalStatus = Nn. Search 

Expresslon.Nun = 0 ' Number of subexpressions entered (none yet!) 
Expression. waxen = -1 ' No fulL search yet 
Expression. Sut*xpr(1). Natch = -1 'No search yet 
Expression. SubExpr(1).Nuo = 0 'Ho natches either 

expression. 3ubexpr<1). Phrase * 

Current Sub - 1 ' current subcxpr (1 to 3) that's being 'edited' 
haxUortfl.cn = 0 1 reset oax word length global variable for ShovExpr 
uordPtr = 0 ' clear the Wordptr for FindRelativesO 
UFOS « 

OriginatExprS = °" 

t ^ the sentence input and place' words directly into Expression 

AddScntence Expression 
ShowExpr Expression 



nenu: 

If Langs <> -German- then 

CALL References ("PRESS LETTER OR USE ■ ♦ CHRSC24) ♦ CMRS(25) ♦ - TO HIGHLIGHT CHOICE AND THEN PRESS ENTER") 

ELSE 

CALL References(-SKOUL0 BE MESSAGE IN GERMAN") 

END IF 

Action - 1 • set initial action to just display menu 
KyS = "Search NENU" 
Choice * 0 

LOCATE 2, 51 

IF LangS 3 "GERMAN* THEN 

ELSg VertHenu GeraanMenuSO, Choice, LEN(GeroanHenu$<1)) # 18, KyS. Action, Cnf, 0, 0, 0, 0, 0, 0. Ter»Ty,»ef»odeS, TeraTypeFlag 
EHD IF K6nuSO ' Ch0ic# ' L£N(HenuS(1)), 18, KyS, Action, Cnf, IT, 0, 0, 0, 0, 0, 0, .TeraTypeHodeS, TeraTypeFlag 

* — r Action is automatically set to 3 for polling after call to VertHenu 

Choice = 1 
DO 

IF LangS = "GERMAN - THEN 

ELSE VertftenU 6erean,,em,SO > a* 01 "- LEN(RenuS(1)), 18, KyS, Action. Cnf. "H", 0, 0, 0. 0, 0, 0, TeraTypeHodeS, TeraTypeFla 
VertHenu HenuSO; Choice, LEN(HenuS(1)) # 18, KyS, Action, Cnf, "H\ 3. 0, 0, 0, 0, 0, TeraTypeHodeS, TeroTypeFlag 

END IF 

IF Action = 3 THEN 'user exited without Baking a choice or Escaping 

IF LEN(KyS) = 1 THEN ' check for direct access (single letter) Choice 

IF Choice THEN EXIT DO 

LETTER ° INSTRtHenoChoiceS, KyS) 

I? LETTER THEN " 

Choice * LETTER 

EXIT DO 

END_IE 



IF LEN(ICyS) =» 1 THEN 

AscKy s ASC(KyS) 
ELSEIF LEN(KyS) = 2 THEN 

AscKy « ASC( RIGHTS (KyS, 1)) ♦ 200 

ENO IF 

SELECT CASE AscKy 



CASE F10, NeuSearchKey 1 new search 
IF LangS s "GERMAN" THEN 

1S * OuestionS(»HEUE SUCHE ? M/M3 -jit, "ACHTUMG I") 
IF iS a "J" THEN 

GOTO ReuSeareh 

END IF 

ass 

iS » Quest ionS ("Hew Search? CY/NT. "W. "UARNINS1") 
IF iS = -T THEN 

GOTO NevSearch 

END IF 

END IF 
CASE FV help 

CALL toitspace 80rry ' ***** ** wrutKly no -elp available. Press the Space Bar to Return to H 
OU DispHag("" # 0, 0) " " 

CASE F2, ShowExprKey 'Show query 
CALL ShowOuery 

CASS ELSE 
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END SELECT 
END IF * 

ELSEIF Action = 4 THEN ■ user exited with a choice, but check for ESCape 

IF KyS « CHSS(ESC) THEN 
GOTO ExitSearcb 

ELSE 

EXIT DO 

END IF 



ELSE 



LOCATE 1, 25 
Chiam 2 

PRINT - — Invalid ACTION value "; Action; ° returned — ■ 

Freehandlas 

UaitSpace 

END 



Action = 5' restore screen 

VertHenu MenuSO, Choice, LEHOtenuSO)), 18, KyS, Action, Cnf, -n", 0, 0, 0, 0, 0, 0. TeraTypenodeS, TaraTypeFlao 
LOCATE , , 0 

ScrSR "S" 

IF Choice - 0 OR (Expression. Nun = 0 AND Choice <> 8 AND Choice <> 10) GOTO Menu 
CALL Selectncnu<niOS(HenuChoice3, Choice, 1), Expression, HotStrS, Global Status) 

IF GlobalStatua «> EX THEN 'not exit 
ScrSR "R" 

ShowExpr Expression 

XF 6lobalStotus » Mew Search GOTO NewSearch 
GOTO Itenu 



ELSE 'exit 
Exit Search: 

IF LangS = "GERMAN" THEN 

iS = QuestionSCExit Prograo? CJ/W", "NJ", "ACHTUNG!") 

ELSE 

iS « QuestionS<"Exit Prograo? CY/NJ", *NY°, "WARNING!") 

END IF 

IF iS = "N" THEN 

ShowExpr Expression 
GOTO flanu 

END IF 
ExItPrograa: 
CLS 

Freehand I es 

END 

DATA -Copyright 1990, 1991 by Ted n. Young. All Rights Reserved.- 

END IF 
'SPege 
Again: 

ftormAttr = OneColort(Fg, BG) 
'PRINT ERR 
CALL ClearfiG 

Dispfteg "Sorry, cannot continue with this action. Please try another.", 0, 0 
DO: LOOP UNTIL LENUNKEYS) 
IF LoadedFleg THEN 

RESUME NewSeorch 

ENO IF 



SUB AddSeatence (Expr AS ExpressionType) STATIC 

'— let's the user type in a full sentence, parses the sentence i 
1 places the keywords found into the current expression 



DIN Ed AS Edit Info 

NUBExprUords =» 0 

Get Sentence: 

Ed. Rows o 3 

Ed.Uide = 76 

Ed.Urap s Ed.Uide 

Ed.AColor = NoraAttr 

Ed.Fraae = 0 

Ed.CurCol » 1: Ed.LC » 1 

Ed.Curt.1ne = 1: Ed.TL « 1 



RED IN Sentences CI TO 3) up to 3 lines of text 
UlRow = S: UlCol » 2 

LRRow s uLRow + Ed. Rows ♦ 1: LRCol = UlCol ♦ Ed. wide ♦ 1 
ScrSR 

CleerScrO UlRow. UlCol, LRRow, LRCol. NoraAttr 

CALL OrevBoxttHRow, UlCol, LRRow, LRCol, 1, NoraAttr) 

IF TersTypef lag THEN 

U1 CherS - "A" 

RNCherS » "P 1 

ELSE 

LNCharS » "C" 
RNCharS » "3" 

END IF 

IF LangS » "GERHAN" THEN 

TttleS o "Stellen Sie Dtre Frege" 
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Titles = "Enter Your Search Request" 

END IF 

CALL OPrintRCCUl CharS * Titles ♦ RNCharS. UlRow, (UlCol ♦ LRCol - LEK(TitleS) - 2) v 2, NoraAttr) 

IF LangS = "GERMAN** THEN 

Titles = "danach ENTER Taste 1 * 

ELSE 

. Titles = -Press ENTER when done" 
END IF 

CALL QPrintRC<LHCharS ♦ TitleS ♦ RNCharS, LRRow. (UlCol ♦ LRCol - LENCTitleS> - 2) \ 2, NoraAttr) 
LOCATE UlRow + 1, UlCol * 1 
xS = ■» 

©Edit SentenceSO, xS, 0 f Ed • 
IF ASC(xS) = ESC THEN EXIT SUB 
ScrSR "R" 

» — put the sentence into a single string 
SearchReqS = a * 
FOR i * 1 TO 3 

IF LENCSentenceS(O) THEN SearchReqS = SearchReqS + » - + SentenceS(i) 

NEXT 

SearchReqS - LTRIHStSearehReqS) 

IF LEN(SearchReqS) » 0 GOTO Get Sentence 

IF LangS = "GERMAN" THEN 

DispNsg "Coeputer liest Sueheingabe", R, c 

ELSE 

DfspJtsg "Parsing sentence.. R, c 

END IF 

IF LangS a "ENGLISH" THEN 

ReadEnglishText SearchReqS 
EL5EIF LangS « "GERMAN" THEN 

ReadGeraanText SearchReqS 

END IF 

REOIN WordListSCI TO 1) 

UordParse SearchReqS, UordListSO, HuaWords 

RED in CoabSd TO 1) 

FindCoabKey UordListSO, NuaUords, CoabSO, NunCoob 
REDIH SingSd TO 1) 

FindSingXey UordListSO, NunWords, SingSO, NuaSing 

■ Now use DictSrchX to convert the synonyms found to their Code Nuabers 

DIN DictTeap AS DietType 

REDIH ExprCodesfl TO 1) AS Code Poly Type 

FOR i = 1 TO Nuacoab 

DlctTesp.Str * coabS(i) 
GOSUfi AddWord 

NEXT 

FOR i * 1 TO NuaSing 

DictTenp.Str = SingS(i) 



NEXT 

IF NuaSing ♦ NunCoab = 0 OR Expr.ttua = 0 THEN 
DispNsg 0, 0 
IF NuaSing ♦ NuaCoob a 0 THEN 

IF LangS a "GERMAN" THEN 

DispNsg "Sucheingabe enthaelt kein bekanntes Mort. LEERTASTS urn -eitertuoachen!", R, c 

ELSE 

OispH&g "There were no keywords found in your sentence. Press the Space Bar now to continue.", R, c 

END IF ' 



IF LangS a "GERMAN" THEN 

DispNsg "Ke1n Dokuaent enthaelt diesen Begriff.*, R, c 

ELSE 

DispNsg "That word/phrase Is not an indexed word in this dataaase. Press the Space Bar now to continue. 

END IF 

END IF 
VaitSpace 
DispNsg 0. 0 
GOTO GetSentence 

END IF 

SortT ExprCodead), NuaExprUords, DESCEND, LEN<ExprCodes<1>), 2,-3 
Expr.Kuo = 1 

Expr.SubExprCD.Nua e NuaExprUords 

FOR It = 1 TO NuaExprUords 

Cod*~3tr Expr.SubExprCI). Phrase, k. ExprCodes(k) . Code 
Oris-nalExprS ■ OriginalExprS ♦ HXIS(ExprCodesCk).Code) 

NEXT 

DispNsg "", 0,0 
EXIT SUB 



LETTERS = LEFTWDictTeap.Str, 2) 
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IF UTTER* = "zs° OR UTTERS = m za° THEN 

SecNua = VALtHlM(OictTwnp.Stf , 3)> 
IF SecNua > 0 AND SecNua « 3000 THEM 
IF UTTERS = THEN 

Code = SecNua * SecCode 

ELSE 

IF SecNua <= 30 THEM Code = SecNua + Art Code 

EM) IF 

END IF 

ELSE 

Code * DlcrSrchS(OlctT«Bp) 

END IF 

IF Code THEN then it's a valid code 

Found a FALSE 

FOR j e 1 TO NuaExprWords 

IF Code 3 Expr Codes (j). Code THEN this code ws already entered 



EXIT FOR 

END IF 

NEXT 

IF NOT Found THEN add it to the expression 

IF NunExprUords * 15 AND KYIndx (Code). Nub > 0 THEN 
.Expr.Nua = 1 

NuaExprWords = NunExprUords * 1 

RED IN PRESERVE ExprCodesO TO Kua£xprUords> AS CodePolyTypt 
Exp rCodes(NUDExprUortb). Code = Code 

EasfietlEl Expr Codes (NtaExprUords). Poly, LEN(ExprCodesll).Poty). Code, PolySeayEHS 

END IF 



ELSE it»* not a valid cods, so indicate an error 

Chiae 2 

DispHsg "The word " ♦ RTSIHSCDictTeap.Str) ♦ - was not found in the dictionary. R, c 
UaitSpace 

Dlspftsg 0. 0 m 



END IF 
RETURN 
END SU8 

SUB ChangeChar (TxtS, NewS, KeepS) 



'can be deleted after change abstract program 

Start « 1 

DO 

ApostrLoc ■ 1NSTR< Start, T*tS, " iU ) 
IF ApostrLoc THEN 

Start = ApostrLoc * 1 

SpaceLOC = INSTS< Start, TxtS, " ") 

IF SpaceLoc > 0 THEN 

NIDSCTxtS, ApostrLoc, SpaceLoc - ApostrLoc ♦ 1) ■ STRINGS CSpaceLoe - Aoestrloe ♦> 1, 32) 
Start = ApostrLoc 

ELSE 

EXIT DO 

END IF 

END IF * 

LOOP UNTIL NOT ApostrLoc 

•the sane for-the f irst ■crd— — 

ApostrLoc = INSTR (Start, TxtS, "•■> 
IF ApostrLoc » 1 THEN 

SpaceLoc ■ INSTR (Start, TxtS, " •) 

IF SpaceLoc > 0 TnEM 

. HIDS(TxtS. ApostrLoc, SpaceLoc - ApostrLoc + 1) - STRINGS (SpaceLoc - XoostrLee ♦ 1, 32) 
Start » ApostrLoc 

END IF 

END IF • 

'delete • ' ' 

Start »1 

DO 

ApostrLoc - INSTRt Start, TxtS, ") 
IF ApostrLoc THEN 

Start « ApostrLoc ♦ 1 

HIOKTxtS, ApostrLoc, 1) a ■ • 

END IF 

LOOP UNTIL NOT ApostrLoc 



»— replace all chars except contained in Keeps 
LENTxt a LENCTxtS) 
FOR i a 1 TO LENTxt 

IF INSTXOCeepS, HIDSCTXtS, 1, 1)) a 0 THEN 
HIDSttXtS, I, 1> « NewS 

END IF 

NEXT 

Start = InstrTbld, TxtS, KeepS) 
IF Start > 0 THEN 
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MDSCTxtS, 1, Start - 1) = STRiNGSCStart -1-1) 
DO 

SpaceLoc * IHSTRCStart, TxtS, - •) 
IF SpaceLoc THEN 

00 WHILE HidCharttxt*. SpaceLoc ♦ 1) = 32 

niDstms, SpaceLoc, i) = chj»(o> 
SpaceLoc = SpaceLoc ♦ 1 

LOOP 

Start » SpaceLoc 

ELSE 

EXIT DO 

END IF 

LOOP UNTIL SpaceUic » 0 
END IF ■ " 



SUB CleerSe STATIC 

STATIC Array!, BGZO '— FALSE If this is the first call, so we need to save 

» * the screen in an array after we fill the background 

• TRUE if we've already itored the bg in the array 

IF TeraTypeFlag THEN 
IF ArrayX THEN 



ELSE 



ScrrttesiO 1. 1, 25, 80, B6XC1) 



as 

RuLer^f a STRINSSC1, 178) ♦ STRINGS (78. 178) ♦ STRINGSO, 17B) 

Ruler2S - STRIN6SC1, 178) + STRINGS (78. 176) ♦ STRINGS (1. 178) 

RulerSS » STRINGSO, 178) ♦ STRINGS (TS, 178) ♦ STRINGSd. 178) 



QPrintRC RulerlS, 1, 1, i 
FOR i * c TO 24 

CPrintRC Ruler2S f \, 1, NoroAttrX 

NEXT 

CPrintiC Ruler3S, 25, 1 f NoraAttrJC 

RE Dirt TO 2000) 

ScrnSa.eC 1, 1, 25, 80, BGZ(I) 

Array = TRUE 

END IF 

ELSE 

CL5 

END IF 

END SUB 

SUB CodeZStr (Stores. _ocat1onX. CooeX) STATIC 
MDSCStoreS, Location^ * 2 - 1, 2) = RKISCCodeZ) 
END SUB 

SUB Conf ig STATIC . 

SHARED NL6SO, LENNUSO. HenuSO, nenuChoiceS, GernanHerwSO,. GercanltenuChoiceS 
CIS 

IF COHNANDS <> ** AND BRANDS <> "/B" THEN 
IF INSTRCCOVWOS, - ") THEN 

OSNaaeS * LEFTS (COHNANDS, INSTRttOKNANDS, "•)-!) 

ELSE 



END IF 

ELSE 

PRINT "No Database Specified. Please type: USER Cdatabase naee]" 
Chioe 10 



OPEN DaNaeeS •» ".CFG" INPUT AS 01 

INPUT #1, Fg, BG, Brzr, LstOirt, DocDIrS, NdxDirS, AbstrDirS, LangS, TeraTypeHodtS 
CLOSE 91 

Cnf JtonTyp = HonitorS 

IF D*STR(C0«NANDS, V5*> OR TeraTypettodeS «> "LOCAL" THEN Cnf.HonTyp « 2 

SELECT CASE Cnf.HonTyp 

CASE 3, 5, 7 'CGA, EGA/Color or VGA/Color eon iters 

Cnf .PwL3ar » AS 

Cnf .Ke-3oa = 49 

Cnf .ActivCh = 48 

Cnf.InActCh * 52*6 

Cnf .Hi^ite ■ 31*79 

Cnf.InAe^MiLt = 64 

Cnf .Norflen » 30 
• Cnf.CurSiie = 13 

IF Cnf.HonTyp n 5 THEN Cnf.CurSiie s 7 
CASE ELSE 

Cnf .PuLBor * 112 

Cnf .flendox « 112'2 

Cnf.ActivCh « 112*10 

Cnf .InActCh = 112* A 

Cirf.HlLfte = 15*112 

Cnf;lnACTHiLt a 60 

Cnf .Nonflen » 7 

Cnf.CurSiie • 13 
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END SELECT 



FUeS = HdxOirS ♦ "AVGDOCFQ.DAT" 
IF NOT ExistXCf UeS) THEN 

as 

PRINT FUeS; • not found." 

CALL Freehandles 

END 

END IF 

OPEN FUeS FOR INPUT AS *1 

INPUT «, RootsfAvgDocFreql 

CLOSE fl 

File* » NdxDlrS ♦ "PolvAvg.Dat" 
IF NOT ExistX(FUeJ) Tr&i 
CLS 

PRINT FUeS; " -as not found." 

CALL Freehand I ?s 

END 

END IF 

OPEN FileS FOR INPUT AS «1 
INPUT ft, PolyA*gl 

CLOSE #1 

FUeS ■ LstDlrS ♦ "3" - LsngS + ".LST" 
IF NOT ExistX(FUeS) Tr=N 
CLS 

PRINT FileS; ■ yas not found." 

CALL Freehandlea 

END 

END IF 

OPEN FileS FOR INPUT AS Si 

LINE INPUT 9%, AtLittS 

CLOSE n 

"COLOR Fg, B6, Brdr 

• wwwwvwvwwwwwvw 
'which terminal t1ae 

IF TeroTypeHodeS a "LOCAL" OR TermTypeNodeS = **PC* THEN 
TermTypeFlag a TRUE 

ELSE 

TeroTypeFlag = FALSE 

END IF 

IF TernTypef lag THEN 

Hist Char* - 0-35C178) 
HighCharS = CHMC219) 
LHargS = "-j " 
RhargS « " f ■ 



ELSE 



IF 



HistCharS • "X' 
HighCharS - — 
LHargS - "L" 
•3- 



NornAttr ■ OneColorXCFg, BG3 
IF Cnf .HonTyp = 2 THEN 
IF BG > 0 THEM 

RavAtt.- » oneColorX<14, 4) 

ELSE 

RevAtt- a OntColorKBG, Fg AND 7) 

END IF 

ELSE 

IF BG > 0 THEN 

RevAtrr a OneColcrZCU, 4) 
MlAttr » oneColorCQ, 3) 

RevAttr » OneColorXCBfi, Fg and 7> 
HIAttr a RevAttr 

END IF 

END IF 
CLS 

IF NOT En Loaded THEN 

PRINT "The Efts Driver has not been loaded." 

Chioe2 

END 

END IF 



i ASC("H") 
EndKey = ASCCE") 
PgUpKey a ASCCP") 
PgDrtCey a ASC("6°) 
RightArrovKey a a$CC"R") 
LeftArrowKey a ASC("L") 
UpArrowJCey & ASC("U"> 
OovnArrowXey a ASCC"D") 
ShowExprfey a ASCCS") 
OirNuaCey » ASCC"I"> 
NewSearchXey a ASCCN") 

SixtyPour a 64 
Sixteen* a 16334 
Thirty Tvott a 32763 
Sing » 0: Coab => 1 
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AscA s ASC("a") 
ASCZ a ASCCi") 
AscUpperA a ASCCA") 
ASCUpperZ = ASC("Z"> 
ASCO = ASCCO") 
ASC9 s ASC(*9") 
tosst! a 2! 
AbstrHode = 8 

NewS = ■ " * replacement str-sg for punctuation 



FOR j - 65 TO 90: Keeps = KeepS ♦ CHRSCj): NEXT 

FOR j a 97 TO 122: KeepS « KeepS + CKSS<j>: NEXT* 

KeepS * Keeps ♦ «"8" •♦-<£-• * CHRS<34) 

NdxLEN = LEN(Ndx) 
FCInfoLEN = LEN(FCInfo) 
KYInfoUN = LENCKYInfo) 
UeightLEN « LEN(Ueight) 
PolyLEN = LEWCPolyValue) 

as 

■ Display the opening title screen — ft 1 a in a separate QSCft library 
IF TernTypaFlag THEN 7 
DiipOSC "USER.GSL", "OPENING" 

ELSE 

DispQSl "USER.OSL", "CPENTERfT 
END IF 

CreateTables 

IF LangS = "GERHAH" THEN 

QPrintRC " C EINLADEN DER OATEN 3 \ 24, 29, KoraAttr 

ELSE 

CFTintRC " C LOADING DATA FILES 3 \ 24, 29, NoraAttr 

EMO IF 

Load Data 
Chiae 6 

HunFileaLoaded » Kumf ilesLoeded ♦ 1 
IF LangS = "6EAHAN** THEN 

ELSE QP " ntK " C EIKU&EN DER MTEN " * STRS(Nu.FilesLoaded) ♦ - 1 ", 2*. 29, Nc-iAttr 
QPrintRC » C LOADING DATA FILES » ♦ STRS<NuoFilesLoaded) ♦ • 3 24, 29, NcraAtrr * 

END IF 

IF UngS <> "GERMAN" THEN 
REDIH RenuS(0 TO 0) 
RESTORE HanuData 
a a 0 



DO 



LOOP 



READ xS 

IF xS a -END" THEN EXIT DO 
RED IN PRESERVE KenuS(0 TO a) 
IF NOT TeraTypeFLag THEN 

a = INSTRUS, "|") 

IF a THEN NIDSCxS, a, 1> a «|» 

IF LEFTS (xS, 1) = THEN xS a STRINGS<24, "-") 

END IF 

RenuSCoO - xS 

a » a ♦ 1 - 



RE DIM GersannenuSCO TO 0) 
RESTORE HenuGeraenOata 
a a Q 
DO 

READ xS 

IF xS « "END" THEN EXIT 00 

RED IN PRESERVE GeraenftetuSCO TO o) 

If J»T TcraTypef lag THEN 

B a lMSTR(xSr B | l, > " 

IF a THEN WDSCxS. a, 1) a 
^ ^ IP LEFTS (xS, 1) » "-" THEN xS « 3TRINGSC24, "-") 

GeraanfteRuSCa) a xS 
m « a ♦ 1 

LOOP 



•create oenuchoiceS Ca list of valid keys to pick froa senu) 
'extract the characters f roa the menu display Itself since the keys 
'are always Listed as the last character on the line 

RenuChoiceS » SPACESCa - 1) 
IF LangS » "GERflAN" THEN 
FOR i a 1 TO a - 1 

xS - RIGHn((Jeraen«enuS(i), 1) 

!r ?* e the "~" choices into chr<0) so they can't be picked 
If xS a "-" THEN XS • CHRS(O) P 
HlOSOtemiChoiceS, i, 1> = xS 

NEXT 

ass 

FOR i = 1 TO ■ - 1 

xS a RIG>«TS(HenuS(i), 1) 

" " akB . t l!* T ^ ^° ic " I?!? **» *> they can't be picked 
. = "-" THEN XS a CHRS<0) 



NEXT 

i IF 



IF xj a •-" THEN xS a CHRS<0) 
rilDSCHenuChoiceS, i, 1) a xS 
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IF Langs <> "gsbwt then 

REDtt KLGJC1 TO a), LENHlGn TO 8) 
RESTORE NLGDaia 
FOR 1 =» 1 TO 8 

READ NIGSC1) 

LBWLG(i) = LEH(NLGSCD) 

NEXT 

REOIH SectiooJCI TO 3) 
RESTORE Sect ion Data 
FOR i - 1 TO 3 

READ Sectioned ) 

NEXT 

REOIH ArticleSCT TO 3) 
RESTORE Arti deflate 
FOR i =» 1 TO 3 

READ ArticleS(i) 

NEXT 

ELSE 

REDIH Paragraphic! TO 3> 
RESTORE Paragraph data 
FOR i » 1 TO 3 

READ ParagraphS<i> 

NEXT . 

REOIH ArtikeLSCl TO 3) 
RESTORE ArtikelDate 
FOR is 1 TO 3 

READ ArtikelSM) 

NEXT 

END IF 

REOIH NuabersSd TO 30) 
RESTORE NuaOata 
FOR i = 1 TO 30 

READ NuabersS(i) 

NEXT 

CALL LoadPrefixesCPrefixasSO, NeanPraf ixesSO, LangS) 
END SUB 

SUB CPMnt US) STATIC 
END SUB 

SUB CreateTables STATIC 

*— load In The 3-letter Index tables froa disk into EMS 
Synb = 28 

First = 1: Last = 2 

REDIH XLateTableXCM TO 122) 

NuaEl « syab - 3 

XLateTabLeX<47) » V / char, as used In non-wildcard words 
XLoteTablcX<38) » 2* ft char, as used in SftP, AiP, etc. 
FOR i » ASCW) TO ASC<"2-) 

XLateTableXCi) * i - 94' so that a=3, b=4 2=28 

NEXT 

RED IN SingTablsFirXCl TO NuaEl) 

FGetAH JLatDirt ♦ -KEYUOao.FIir, SEG SingTableFirXO), 2, NuaEl 
Array2EHS SingTableFirSO) , 2, NuaEl, SingTblFirEHS 
ERASE SingTableFS rt 

REDIH SingTablelasSCJ TO NuaEl) 

FGetAH LstDirS "KEYUO&a.LAS", SEG SingTableLasX<1), 2, NuaEl 
Array2EHS S1ngTableUs2(1), 2, NuaEl, SingTbtLasEHS ~~~ ~ 

ERASE SingTablelasZ 

REOIH CoabTableflrXCI TC .NuaEl) 

FGetAH LstOirS ♦ *KEYC0r?3« FIR", SEG ConbTableFirXd), 2, NuaEl 
Array2Ens CoabTableFlrlCi), 2, NuaEl, CdafcTblFirEHS 
ERASE CoabTableflrZ 

REDIH CoabTubleLasXd TO NuaEl) 

FSetAH LstDirS ♦ •KETC0H3.LAS", SEG Cot&TableUsXd), 2, NuaEl 
Array2EHS OoabTableLasZCI), 2, NuaEl, OabTbllasEftS 
ERASE CoabTableLasX 

END SUB 

SUB CutVord CKUerdS) 

'— cut KUordS by •-- or by • " or just truncates aore than 24 letters ' 

IF LEN(KUordS) » 24 THEN 

KtterdS a LEFTS OOJordS, 21) ♦ 
1 — we'd prefer to cut off the word at a 

END IF 
END SUB 

SUB Define (uardS) STATIC 

REDIH ScrXCBOO)' 3 rows Of 80 eoluens, including window 
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HScrnSave 21, 1, 25, 80, ScrXO) 
ClearScrO 21, 1, 25, 80, RevAttr 
CALL DrawBoxC2l, 1, 25, 80, 1, RevAttr) 

IF L«ng$ = »6ERHAN» THEN 

QPrintBC -Definition vcn ♦ Worts ♦ •••• 23 4 -1 
ELSE " ' * * 

^ QPMntRC "Definition of *■ ♦ words + 23, 4, -1 

wait Space 

NScmRest 21, 1, 25, 80, SEG ScrXCD 
ERASE ScrX 

END SUB 

FUNCTION Diets (Code?) STATIC 

'= This function translates a code* to* the actual word/phrase 

" ^ 11™™^"*"** ™' U ' 5 a SU4PS ~l«ive 

Flags = CHRSt250) 

OSE 

Flags s " - 

END IF 

iTTT ^I " hMd " ,or tMs coda 
DM Diet leap AS DIctTyae 

CALL EasGet1El(SEG DictTeap, LEW (Diet Temp), CodeX, OictUordHX) 
IF Code > 0 THEN 

' return tr>e frequency of the word prefixed to the word itself 

•OietS • Spece*vaJ(Code, 4) ♦ " • ♦ SpaceNuaSOcrr«rfWr«w.> «. ^ 

^e»iuasocrindx<Code).Nuii, 4} ♦ f Lags - RTRMSCDictTeap.Str) 

. .f" 8 ^ ^W^- ^1>LEN, CLNGCCode), PolySemyEHS 

Diets » SpaceNuaSC^CCodO.Nua, 3) ♦ ? • /iSScoietTeap.Str) 

•* use the following i-ne instead to display poly walue 

0,ctS = UraaTOIMtS^SCfelyVaUie.VaSe)) S> ♦ ■ « ♦ W^rrvr^ , 

. SpaeeNumSClcyindxCCCde).!*™, 5 ) ♦ Flags ♦ RTRI»S<0ictTeBp.3tr> 

ELSE ' this word wasn't found? 
Diets = ■ ftla-O " 

END IP 

END FUNCTION 

FUNCTION DictSrchX (Were AS DictType) STATIC 

\Z ^"L!"^" *™ ^\ ar ; ay DictType (word ft code*) for a 
•= oTetcS c^sha^** ™« 

L^?cr:^D s ict?e2r' 01ctTttp2 m 0<ctTy * 

1 = 1: R a DictCodeNual * total mater of code entries 
DO 

* = (CLNGU) - \ 2 

EasSetlEt DictTeap, LENDict, x. DictCodeHX 

IF LCASES<Word.Str) < LCASES<DictTeos>.5tr> THEN 
R = « - 1 

ELSE 



IF tCASSS(Uond ; Str) <> LCASESCOictTcep.Str) OR LangS «> "EfffiLlSH* T,£N 



ELSE 



r f!Zl\ Xmp S^ i ^ «se we need t0 restore it 
if the forward/backward look doesn't bring 
P°^ e "^ta (i.e., we didn't get a retch and we 
need the original DietTemp for the LOOP test) ' 

SWAP DlctTerpZ, DictTeap 

EasSetlEl DIctTenp, LENDict. x ♦ 1, DictCodeHX 

IF word.Str = DlctTeop.Str THEN ■ 

X » X ♦ 1 

EXIT DO 

ELSE 

CasGetlEl OictTeap. LENDict, x - « DictCodeHX 
IF Word.Str = OictTeep.Str THEN 
X a X ♦ 1 

EXIT DO 



END IF 
END IF 

SUAP DictTeap. 0ictTeas2 

I 8 t ♦ 1 ^ 

END IF 

END IF 

LOOP UNTIL word.Str « DlctTeop.Str 08 I > ft 
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IF Uord.Str = DictTeep.Str then it was found, so return where 

OictSrchX s DictTesp.Code 
Word. Code = DictTeap.Code 

ELSE wasn't found, return 0 

OictSrchX = 0 
Word. Code = 0 

END IF 

END FUNCTION 

SUB OispMsg (MsgS, RX, cX) STATIC 

STATIC UindOpen, Ser2() ' fs there already a aessage displayed? 
LOCATE , , 0 • turn the cursor off 

ir^r™^ " ~ if u « " eto » ^ 

IF HsgS = THEN 

» Efa^"" 11 ' *° tlMe - ind ~ - »<• ** 

EXIT SL3 

ELSE 

^cXST'* r " u * uind ~ *• <*«• » - 

GOSUB ttig Close 

END IF 

END IF 

Uid » LEN(HsgS) 

IF Uid > 60 THEN Uid - 60 

HagS » HsgS sake sure there's a space to find at the end (see below) 

RaxLin * LENOtsgS) \ Wis * 3 
IF HaxLin > 23 THEN flaxen = 23 
RED IN TextSOtexLin) 
Lin ■ 0 



*T ,B f l?ne counter for number of lines of text 
Lin — Lin ♦ i 

*— look for the Last space so we can word wrap at that point 
LastSpe a oinstrSUUid ♦ 1, fegS, " ») ^ 
TextSCLin) = RT3IKSC LEFTS (hsgS, LastSpe)) 

LOO? WHILE LEN(ttsgS) > Uid 

ItsgS « RTRlHSOisgS) 
IF LEN(ftsgS) THEN 

Lin * Lin ♦ 1 

TextS(Lin) » HscS 

END IF 

•— ■ recoapute width based on word-wrapped text 
Uid — 0 

FOR i = 1 TO Lin 

^ If LEN(TextS(i)) > Uid THEN Uid = LEN(TextS(i)) 

IF R <> 0 AND C » 0 THEN 

UU = R 



ELSE 

ULr =» 11 

END IF 
DULr = ULr --1 
LRr =- ULr ♦ Lin ♦ 1 
DLRr » LRr ■* 2 

heriznrgin = CfiO - uid) \ 2 . 
ULc = heriaeargin » 
OULc = ULC - 3 
LRc = 80 - INTlhorizaargin) 
IF Uid / 2 =» Uid \ 2 TK2* LRC = LBc + 1 
OUte » LRc ♦ 1 

RED IN ScrZ(ArraySl2eZ(DULr. DULc. DLRr, OLRO) 
ScrnSaveO DUU, DUU. DLRr. OLRc, SES ScrX(O) 
UlndRgr ULr, ULc, LRr, LRc, 4, NoroAttr. NoraAttr, 
FOR i » 1 TO Lin 

QFrlntRC TextSCD, ULr ♦ 1, ULc ♦ 1, -1 

NEXT 

r a ULr ♦ Lin 

C • UU ♦ 1 ♦ LEMTextSiLin)) • 

If LEN(TextS(Lin)> ♦ 2 • Wid THEN c • ULc ♦ 1: R • R ♦ 1 

ERASE TextS 
UindOpen ■ TRUE 

EXIT sua 

' close window 
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NsgClose: 

ScrnSestO DULr, Dl/Lc, >JKr, DLRc, SEG ScrZ(O) 
ERASE ScrX 
WindOpen - FALSE 
RETURN 

END SUB 

SUB EoaAUoc CNuaPagesL. landleX, LoadFUeS) STATIC 

CALL EosAllocNeoCNuaPasesX, Handle*) 
IF EasErrorr THEN 

PRINT "Couldn't a.locate"; ONGCNuaPages) * Sixteen* / 1024; "Kbytes of ENS for LoadFUeS 
CALL Freehand les 
CALL Ch1oe(2) 
END 

END IP 



SUB FindCoabKey (vdrdListSC). NunUordX, CaabFoundSC), NusCoabFoundX) STATIC 
'ARRAY NAME LEN DESCRIPTION DIRECTION HQDIFIED? 



'CcabKeyFlLE U Coab-^e* Keyword List (Shored) (Unchanged) 
'CoabFoundSO VAR Coabined Keywords Found (Returned) (Changed) 
'UorolistSO VAR Doeuaer* Words (Passed) (Unchanged) 

DIN KeyTeap AS CoabXeyT/oe' tor interface to Eas6et1El 
LENKeyTeap = LEN(KeyTeap) 



FOR i = 1 TO Kuavord - V nuafoor of words to process 
At Flag = FALSE 
Hot Flag = FALSE 

*— oake lowt- case since coabined keywords ignore case 
UordS = LCASESC-orOLlStS(i)) 

IF UordS ■ — GCTO SkipCoabKey 

• If it's a «at1d range, then do eoaparisens for words in the range 

IF FirstlastXCwcreS, First, Last, Comb) THEN 

. FOR J = Last TO First STEP -l 

FGetRT CoobJteyFILE, KeyTeap, CLNGCj). LENKeyTeap 
words » InCount(RTRINS(KeyTeap.Str), - •) ♦ V count nuaber of words 

1 <f , the keyword has acre words than are left in the wore list 

!_ * klp there's no possibility of a Batch. 



i ♦ 1 

IF LangS = "GERMAN" THEN 1 At Flog should work onlv for Geraan 
IF iHSTRCKeyTeap.Str, -a*) « 0 GOTO SklpConoKey 

ELSE * 

GOTO SkipCoobKey 

END IF 

END IF 

CALL Extract ( Key Teop.Str, - 1, Strt, Slen)' extract first -ord 
CurrKeyS a HIDSCKeyTeap.Str, Strt, Slen) ' of coabined keyword 

IF RIGHTS (CufTKeyS, 1) = -/" THEN 
Exact » TRUE 

CurrKeyS = L£FTS(CurrKey$, Slen - 1) 
Slen a Slen - 1 

ELSE 

Exact « FALSE 

END IF 

* coapere first word of- coabined key [CurrKeyS] — — - — — 

* against the current document word [WordS] ■ 

IF RIGHTSCUordS, 1) » »/•* THEN UordS = LiFTS (UordS, LEMUorcSJ - 1) 

I? Exact THEN 1 check for ♦exact* natch 

Natch = (LCASES< CurrKeyS) » LCASES (UordS)) 

Natch = (LCASESCCurrKeyS) « LCAS ESC LEFTS { words, Slen) ) 5 

END IF 

• no oatch, skip to next coabined key in the First-Last rancs 
I? HOT Natch GOTO SkipCoabKey 

' continue Batching the rest of the words in the coabined ke* 
' exiting out as soon as there's a non-oaten 

FC3 k « 1 TO Words - 1' nuaber of words left 1n. coabined ke> 

' extract the next word frob the current coabined ke/werd (i) 
CALL ExtractOCeyTeBp.Str. V, k ♦ 1, Strt, Slen) * 
CurrKeyS ° NIOS(KeyTenp.Str, Strt. Slen) 

IF RIGHTS (CurrKeyS, 1) • -/" THEN 
Exact » TRUE 
-rrKeyS - LEFTSCCurrKeyS , Slen - 1) 
Slen « Slen - 1 

ELSE 

Exact a FALSE . 

END IF 

■ — Find ABCGenan only) ASS, 8 ASnotG in Ooe if A£c is in Diet 
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IF AtFlag = FALSE AND HotFlag a FALSE AND i ♦ k <= Nvswcrd THEM 

DocUordS = UonJUstSU ♦ k) 
ELSE 

IF AtFLag a FALSE AND tot Flag = TRUE AND i * < « Nuavord THEM 
DocvordS = uordLlstS(i ♦ k ♦ 1) 

ELSE 

IF At flag = TRUE AND NotFlag a FALSE AND k > 1 THEN 
DocUordS a UordListSCi ♦ k - 1) 

ELSE 

DocUordS = UordLlstSO ♦ k) 

END IF 

END IF 

END IF 

IF RIGHTS (DocUordS, 1)a»/» THEN DocVordJ = LEFTS CDs eitordS, LENCDocUordS) - 1> 

IF CurrKeyS = **3" THEN • special processing for a wildcard 
IF INSTRCAtliatS, ♦ DocUordS ♦ "/") THEN 

Hatch « TRUE' the word was in the a list, so continue 

ELSE 

IF LengS « "ENGLISH" THEN 
Natch a FALSE 

ELSE 

Hatch - TRUE 
AtFLag « TRUE 

END IF — . - 

END IF 

ELSE 

IF Exact THEN * check for "exact* natch 

Hatch » aCASES<CurrKey$> =» LCASEKDocuordS)) 
ELSE * wildcard aotch. only coapare ff of chars in CurrKeyS 

Hatch ■ (LOSES (CurrKeyS) « LCASES<l£rTS(DocwordS, Slen))) 

END IF 

END IF* 

IF Hatch AND i ♦ k « Ruaword THEN 

DocUordS a uordListSO + k ♦ 1) 

IF DocUordS = "not" OR DocUordS a "be' OR DocUordS a •nicht' 1 THEN NotFlag « TRUE 

END IF 

IF NOT Hatch THEN EXIT K» 
NEXT* word 1n current coabined keyword 

IF Hatch THEN ' this is a coabined keyword, so add it to the list 
NunCoatoFound a NutaCoabFound ♦ 1 
RED IN PRESERVE CoabFoundSCI TO HuaCeafcFound) 
CcafcFoundS(miaCoabFound) a RTRIRS(KeyTefip.Str) 

'— «arte coabined word so that single keys 
• are not generated f rea parts of coabined keys found 
IF NotFlag AND NOT AtFlag THEN Mf A 3 not 3 as A a B 

FOR k » i TO i ♦ words 

UordListS(k) a uordListS(k) ♦ CHRS<255> 

NEXT 

else 

IF NOT NotFlag AND AtFlag THEN 'if AS as A 3 B 

FOR k » i TO i ♦ words.- 2 

UerdListSOU » UordlistSCk) - CHRSC255) 

NEXT- 
ELSE 'noma I srtcedure 

FOR k a i TO i * words - 1 

UordListS(k) = UordListS(k) - CHRS<255> 

NEXT 

END IF 

END IF 
EXIT FOR 

' END IF 

SkipCocoKey: ■ 

NEXT 

END IF* Table range uas valid 
NEXT* key in list 
END SUB 

SUB FlndSlngKey <VordlistSO. NuaUordX, SingfoundSO, NuaSittgFoundX) STATIC 

DIH KeyTeep A3 SingKeyType' for Interfoce to EasfietlEt 
DIH ExclTeop AS Diet Type 
LENXeyTeap » UENCKeyTeco) 
HuaSingFound ■ 0 
. ExcludeAddS - 
FOR 1 a 1 TO Nuaword' number of words in docuaent 

HeanPrefixFleg • FALSE: PrefixFlag » FALSE: CoabineFlag a FALSE 
UordS a wordListS(i) 
If words «» THEN 

IF RI6HTS<UordS, 1) = CHRS(255) THEN 

UordS « LEFTSCUordS. LENCUordS) - 1) 
CoabineFlag a TRUE . 

END IF 

IF Lang* <» "GERHAN" THEN 

' SELECT CASE ASC(wordS) 

CASE AscA TO Asa: SearchStep a 1 
CASE AscUpperA TO ASCUpperZ: 

IF ASCCHIOSCUordS, 2. 1» < AscUpperA OR ASC(fl IDS (VordS, 2, 13) > ASCUpperZ T 
SearchStep a 2 

END IF 

CASE ELSE: SearchStep a 3 
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END SELECT 

END IF 

' check If the first 3 tetters of the word return 
• a valid range froa the 3-dloenslonal table array 1 

TryAgoin: 

Slen a LEN(CurrKeyJ) 

IF RIGHTSCCurrtCeyS, 1) « »/" THEN 
Exact = TRUE 
* CurrtCeyS = LEFTS ( CurrtCeyS , SI en - 1) 
Slen = Slen - 1 

ELSE 

Exact = FALSE 

END IF 

•conpare the single keyword CCurrtCeyS/KeyTeap.Strl 

•against the docucent word Gtordsj 

IF Exact THEM 1 check for 'exact* natch 

Kstch = (CurrtCeyS = words] 
ELSE 1 check for wildcard aatch 

tp hatch a <CurrlCeyS = LEFTSCUordS, Slen)) 
END IF 

ZF Match THEN 

IF NOT Costrinef lag THEN 'for both languages 
' add the single keyword to the list 
NuaSlngFcund - HuaSingFoumJ ♦ 1 
RE0IN PRESERVE SingFoundSd TO NuaSingFound) 
SingFoundSCNuaSifxjround) = RTRIMUeyTeap.Str) 
EXIT FOR 



ass 



1 ,1 ^TL ° cambku ' *° *o't "ore 1t but add to the exclude list 
ExclTeap.Str = BTHIMClteyTe«p.Str> 
Code = DictSrtMExclTeap) 
EXItIcV ° ^ &tCludeAddS ° ExcluaeAddS ♦ HXISCCode) 

END IF 

END IF 
NEXT' key in range 

IF NOT hatch AND LangS <> •gerhan" THEN 
SELECT CASE SearchStep 

CASE 1: WordS » UTASEStUFTSOJordS, 1)) ♦ LCASESC8IGHTS<UordS, LEN(WordS) - 1» 
SearchStep » Searcf-Step + 1 
GOTO TryAgain 
CASE 2: VordS = UCASESttfordS) 

SearchStep = Searcrsteo ♦ 1 
SOTO TryAgain 
CASE ELSE: ^ 

IF 1 = 1 THEN 

FlrstLetter o ASC<UFTS<LTRIH*<uordS>, 1)) 
IF FlrstLetter >= 65 AKO First Letter « 90 THEN 
Veres = LCASESCUo MS) 
GOTO Try Again 

END IF 

END IF 

END SELECT 
END IF 'not natch 
ELSE 1 range was not valid 

LETTERS a LEFTSCUordS, 2) 

IP LETTERS a "2S" OR LETTERS a »»- THEM 

'add the single keyword to the list 
NuBSIngFound a NtjaSlrigFound ♦ 1 
REDIH PRESERVE SlngfoundSC! TO NuaSingFound) 
SirtgFwnd*(NunS1ngfound> a RTRIttS(uordS) 

END IF 

£«P IF* tne range wa s vali d 

check for prefixes 

IF NOT HeanPref ixFlag THEN 

"^.^TEgwuK»"- » <™« — t. t» par:, 

LenU a LEN<VordS) 
FCR NuaLat = UT0 3 STEP -1 

IF LenU >= fcnUt ♦ 3 THEN 'should leave at least 3 letters 

IF INSTRCMeanPrefixesS<NuaLet> # -\- ♦ LEFTS (UardS. NuaLet) ♦ -\«) then 
UordTeapS . WW<UordS, Nustet ♦ i; ' ' THEN 

Words « LEFTS (Words, Nu*Let> 
HeanPrefixFLag ■ TRUE 
EXIT FOR 

END IF 

END IF 

NEXT 

IF BeanPref IxFlag THEN 

tFJatch THEN CoabineFlag » TRUE 
END IF Tr yAgain 'check agatn 
ELSE 'if ReanPref ixFlag 

IF yordTe^« --THEN 'CoabineFLog still TRUE 

wbrdS a uordTtepS 

WordTedpS a 

GOTO TryAgain 

END IF 
END IF •HeanPrefiaFLag 

£Lff , * n9less prtf *"* °>^te it 
IF NOT Pref IxFlag THEN 'only one tie* 
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UordS = LCASES (UordS) 

LertJ 3 LEN(UordS) 

FOR HuaUt » 9 TO 2 STEP -1 

IF LenU ~ Mualet ♦ 3 THEM 'should leave at least 3 Utters 

IF INST8(Pref ixesS(NuaLet>. *\ m ♦ LEFTS(UorcS, MuaLet) ♦ *V) THEM 
UordS - MIDJ(Word$, Hustet ♦ 1) 
PrefixFlag = TRUE 
EXIT FOR 

END IF 

END IF 

NEXT 

If PrefixFlag THEM 

IF Hatch THEM Combine Flag = TRUE 
SOTO TryAgain 'check again 

END IF 
END IF 'PrefixFlag 
END IF 'UordS > " " 



NEXT' word in document 



END sua 

FUMCTIOM FiratLaatX (UordxS, FirstX, LastX, SlngOrcoahX) STATIC 

• returns the starting (First) and ending (Last) range for the word 

• by looking it up in the TableXO array 

UordS - RTRIHS(UbrdxS) 

IF LENCWordJ) < 3 THEN UordS » WordS ♦ STRINGS (3 - LEN(UordS), V) 

SpaceLoc » INSTRCWordS, - ") 

IF SpaceLoc > 1 AND SpaceLoc * 4 THEN 

^ ^ UordS = LEFTS (UordS, SpaceLoc - 1) ♦ STRINGS (4 - SpaceLoc. "/") 

a = XLateTableXCAScmCUordS)) 
b = XlateTablettNidChar (UordS, 2)) 
c = XUteTableXCMidChar(UcrdS, 3>) 

IF a 3 0 M b a 0 OR esO THEN First Last = 0: EXIT FUNCTION 
Index* = (a - 1) • 784 ♦ (b - 1) * 23 ♦ c 

IF Singorcoabx ■ sing THEN 

TabtefirENS s SingTblFirEHS 
TableUsENS = SingTblLasEHS 

ELSE 

TableflrEHS = CoabTblFtrEHS 
TebleLasENS = CoabTblLasEHS 

END IF 

Ensaet FirstX, 2. IndexS, TableFlrEHS 
EasOet LastX, 2, Index*, TableLasQIS 

'Return FALSE if there was no valid range (i.e., F1rstZ=0) 
FirstlastX » (FirstX «> 0) 

END FUNCTION 



SUB Freehandlas STATIC 
« Release E 



IF OietCodcHX THEM EnsRelRea DictCodtHX 
IF OietUordHX- TNEM EesRelHs* DictUordHX 
IF SingTblFirEHS THEN EasRelHea SingTblFirEHS 
IF SingTblLasEHS THEN EasftelHea SingTblLasEHS 
IF CoBbTblFirEMS THEN EasRelHea CoabTblFirENS 
IF CoabTblLasEHS THEN EasRelHea CoabTblLasEHS 
IF Array! ENS THEM EasRelHea ArraylEltS 
IF PDlySeayEHS THEM EasRelHea PoLySeayEflS 



Close Files 



IF KeyNdxFlle THEN FClose KeyNdxFlle 

IF KYInvertDatPlLE THEN FClose KTInvertpatFILE 

IF FCInvertDatFIU THEN FClose FClnvertDatFlLE 

IF HcederFILE THEN FClose Headerf ILE ' 

IF SingXeyflLE THEN FClose SingXeyFILE* 

IF CoebKeyf ILE THEN FClose CoafaKeyFlLE 



— close eoa port if opened 
CPrint "CLOSE" 
END SUB 

SU8 HistRessage STATIC 

•UlndNgr 9, 66, 10, 79, 4, RevAttr, RevAttr, - 

^ r ] nt fE ! CHTK 65, RevAttr 

QPMntRC • for HIGHLIGHTS \ 10, 65, RevAttr 

END sua 

SUB InsertUord (Expr AS ExpressionType) STATIC 

'— insert the last word deleted froa a LIFO stack 
and then do a ShovExpr 
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get the nusber of words in the deleted word stack (LIFOS) 

X = LEN(LIFOS) 

If x THEN 

Code - CVKRIGHTSCLIFOS, 2)) 
Expr.SubExpr(1).Nua * Expr.SubExpr<1).NuB ♦ 1 
NuaExprUords = NuaExprUords ♦ 1 

SortT ExprCodesd), NuaExprUords, DESCEND, LENCExprCodes(D) 2 -3 
- Expr.Nuo =1 * 
Expr.SubExpr(1).Huo = KunExprWords 
FOR k ^ 1 TO NuaExprVoros 

NexT CodeZStr Expr.Sub£xprC1).Fhrase, k, ExprCodes(k).Code 

IF X = 2 THEM 

UFOS = 

ELSE 

LIFOS s LEFTS(UFOS, X - 2> 

END IF 



no words to undelete (insert) 
Chine 1 

END IF 
END sua 

FUNCTION KeylnstrZ (KeyStrS. SrchS) STATIC 

returns TRUE if SrchS is in KeyStrS at en odd offset 

• otherwise FALSE (i.e. it wasn't found, or was found at an even offset) 

k = INSTR (KeyStrS, SrchS) 

1 continue searching if it was found at an even offset 

1 but stop if it's 0 (not found anymore) or odd (found correctly) 

DO WHILE k / 2 = k \ 2 AMD k <> 0* even 
k = INSTR (k + 1, KeyStrS, SrchS) 

LOOP 

IF k s 0 THEN • it wasn't found in the right place 
KeylnstrZ - FALSE 

ELSE 

■NOTE: Since this routine does not return a TRUE value, but simply a 
non-zero value, the NOT operator can't be used with this function 
When using KeylnstrZ in an IF.. THEN stateaent, it eust test for 
KeylnstrZ = FALSE. 

KeylnstrZ « (k ♦ 1) \ V return actual location 

END IF 

END FUNCTION 

FUNCTION KeyflidS (KeyStrS, Start) STATIC 
KeynidS = HIDS(XeyStrS. Start • 2 - 1, 2) 
END FUNCTION 
SUB Load Data STATIC 

' 1 DICTIONARY 

• Load in word Sorted Dictionary directly Into ENS (translate Code* to ItardS) 

EasPg a.EasfietPFSeoZ 

FileS a LstOirS ♦ "DICCOD" 

G0SU3 UpdateStatusLine 



DIM DictCodeTeop AS DietTyoe 

IF NOT ExistX(FileS) TsSN CLS : PRINT FileS; ■ not found.*: END 

Si zeof FUeS = FlleSizeS(FUeS) 

DictCodeNua « Si zeof FileS \ LEN(DietCodeTeap) 

• -Loading " *• NunSCDictCooeNue) ♦ - Dictionary Synonyo Entries: ,\ r, e 

NunPaoes = Si zeof FileS \ SIxteenK ♦ 2' round off to nearest 2 pases 
. EasAlloc NuaPag.es, DlctCoceHZ, FileS ^ 

IF EraErrcrZ GOTO EMErrNandler 

Nua32k6loc*a • SizeofFUsS \ ThirtyTwoKS 

UftOverS - Sizeof FileS - (Nun32kBlocks • ThirtyTwoKS) 

FOpenAU FileS, 0, 4, DictCodeFILE 
FOR 1 * 1 TO Nua32kBloccs ♦ 1 

OPrintRC STRS(i), R, e - 3, -1 

^ fl ^ 2 0f MctCOdeW oeaory to the ENS upper sea page f raw 
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EasflapKM DIctCodeHX. j, (1 - 1) » 2 * J 
IF EasErrorZ GOTO EMSErrMandler 

NEXT 

1 — seek to beginning of current block 
FSeek DlcTCode?IL£, (1 - 1) * ThirtyTwott 
IF OosErrorX GOTO DOSErrMandler 

IF i < ttua32kBlocks ♦ 1 THEN 

*— set the 32k block and put it directly into the EMS page fraee 
FGetA DictCodeFILE, BYVAL EnaPg, BYVAL 0, ThirtyTwoKS 
IF DosErrorX GOTO OOSErrHandler 



ELSE 



END IF 

NEXT 



'— load the left over C<32k) bytes 

FGetA OictCodeFILE, BYVAL EasPg, BYVAL 0, LeftOver8 

IF DosErrorX GOTO OOSErrHandler 



FClose DictCodeFILE 



1 contains pointers to DICT.COD in ENS for use in fullTtxt 

REDIH ArraylCI TO DictCodeNua) AS ArraylType 
OictLEN - LENCDietCodeTeap) 

GOSUB Updates tatua Line 

FOR i » 1 TO OictCodeNua 

Array1(i).Recf«u« . i 

EasGetlEl DietCodeTeap, Diet LEW, i, OictCbdeHX 
Arra/l(i>.CodeHua = DietCodeTeap. Code 

NEXT 

SortT ArraylCI), DictCodeNua, ASCEND, lENCArray1<1)>, 2, -1 
Arrey2EHS ArraylCI), LENCArrayUI)), DictCodeNua, Array! ENS 
ERASE Arreyl 

* Load in Coda Sorted Dictionary directly Into ENS (translate Code to words ) 

File! * LstDirS ♦ ■DICT.VR0" 

din DletuordTeap as Diet Type 

sizeofFlle* * FILeSlze&CFileS) 

DlctuardHua = slzeofFlleft \ LENCDictVordTeqp) 

Sectode = DIctVordNua - 3030 

Artcode » DlctWordNua - 30 

GOSUB updates tatusLine 

NuePages a SizeofFileft \ SixteenK ♦ 2* round off to nearest 2 pages 
EasAlloc NuaPages, DictVordHX, FileS 
IF easerrorX GOTO EHSErrHendler 

Nua32kSlocks ■ Slzeof F1l«& \ ThirtyTWott 

leftover* » Si zeof FileS - CNva32kBloek* • ThlrtyTwott) 

FOpenAll FileS, 0, A, OictUordFILE 
FOR 1 a 1 TO Mua32kfiLocks + 1 

QPrintRC STRSCO, H f c - 3, -1 

*— aap pages of the DlctuordHX neaory to the ENS upper sea page frase 
FOR j = 1 TO 2 

Easftapnea DictVordHX, J, Ci - 1) * 2 + j 

IF EnsErrorX GOTO EHSErrHandler 

NEXT 

' — seek to beginning of current block 
FSeek DictUbrdFlLE, l\ - 1> * ThirtyTwoKS 
IF DosErrorX GOTO OOSErrHandler 

'_ it_J_< Nu»32k8 locks ♦ 1 THEN _ 

»— get the 32k block and put it directly into the ENS page fraae 
FGetA OictUordFILE, BYVAL EasPg, BYVAL 0, ThirtyTwoKS 
ELSE , 

load the left over (<32k> bytes 
FGetA OictVordFllE, BYVAL EasPg, BYVAL 0, leftover*' 

END IF 

IF DosErrorX GOTO OOSErrHandler 



NEXT 

FClose DictUordFILE 

» INVERTED RELATIVES 

• load/open the Relative Inverted Files for the SWAPS routine 

Filet a NdxDirS ♦ -REl-lNVS.NDX" 
IF NOT Exi*tX< FileS) THEN 
CLS 

PRINT FileS; " was not found." 

CALL Freehand l-s 

END 

END IF 
NdxLEN » 6 

FreqCoapNua = FlleSlze&CFHeS) \ NdxLEN 

RED IN FCIndxH TO OictVordNua) AS SaallNdxType 

GOSUB UpdateStatusline 

FGetAN FileS, FCIndxCI), NdxLEN, FreqCoepNua 
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FileS » NdxDirS ♦ "RELATIVES" 
FOpenAU File$ # 0, 4, Fr«j63FILE 



— ' I M V E R T E 

* Load the Inverted Data Piles 

filet » NdxDirS ♦ "KYINVRTs.NDX" 
XP NOT EUistX(FileS) THEN 
CL3 

PRINT fileS; " was not found." 

CALL Freehandles 

END 

END IF 

NuaKeys « FHeSfeeftCFUcS) \ NdxLEN 

REDIN KYIndx<1 TO DictVordNuia) AS SoellNdxType 

GOSUB UpdateStatusLine 

FGetAH FileS, SE5 KYIndxCI), NdxLEN, NusKeys 

Files a NdxDirS ♦> "mNVERT.DAT" 
IF NOT ExistZ(FileS) THEN 
OS 

PRINT Files; - was not found. " 

CALL Freehandles 

ENS 

END IF 

G0SU8 UpdateStatusLine 

FOpenAU Files, 0, 4, KYInvertDatFlLE 

• Open the NEVKEY.NDX file, sorted by ve1ghr*poly-0.25 for use when 

' showing keywords in current docunent 

FileS • NdxOirS + "HEWC5Y.NDX" 
Nwsftecordsft - FileSiteS CFileS) \ 256 
IF NOT ExistZ(FileS) THEN 
OS 

PRINT FileS; • was not found." 

CALL Freehandles 

END 

* END IF 

GOSUB UpdateStatusLine 
FOpenAU FileS. 0, 4, KayfdxFiU 



BITCOUNT 



DIN B1tCountZ(0 TO 255) 
FOR 1 = 0 TO 253 

XS = CHRSC1) 

X = 0 

FOR j * 0 TO 7 

x - x - SetBitXCxS, J) 

NEXT 

BUCountXCi) = x 



■SINGKET/ COHBXEY 



FileS o UtDirS ♦ "SINGKET. STB" 
NuaSingXey o FlleSueS(FUeS) \ 32 

GOSUB UpdateStatusLine 

FOpenAU FileS, 0, 4, SingKeyFILE 
IF DosErrorX GOTO DOSErrHandler 

FileS a UtDirS ♦ •CONBKEY.STR" 
NumtoebKey a FileSizeKrileS) \ 64 
■FOpenAU Files, 0, 4. CoioKeyFILE 

GOSUB UpdateStatusLine 

IF DosErrorX GOTO DOSErrHandler 

' POLYSEMY 

NuoPaces = CLNSCFreqCoooNua) * PolyLEN \ sixteenK ♦ 2 
EasAUoc NuaPages, PolySeayENS, -polysemy Storage" 
Files « NdxDirS ♦ "POLTSSTY.dat" 
GOSUB UpdateStatusLine 

FOpenAU FileS, 0, 4, PolyseayFILE 

• — to fit in waller memory load only as eany poly values as there ere 
•FCfl i = 1 TO FreqCoapNua 

• FGetRT PolyseayFILE, Poly Value, CLNBO), PolyLEN 

• EasSet PolyValue. PolyLEN, CLNGCi), PolySeayENS 
•NEXT 

Sixeof FileS = FUeSixeSCFileS) 

N ua g fc fl t o cx s » S1zeofF1le& \ Thirty Tuoffi 

Leftovers a si reof FileS - (Ku&32kfi locks • Thirty TuoXX) 
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FOR i = 1 TO Kua32kB locks ♦ 1 

CPHntRC STRS(i), R, c - 3, -1 

' — aap pages of the PolySeoyEas wxmory to the EMS upper mem page fraae 
FOR j - 1 TO 2 

EosKaortes PolySenyEHS, j. CI - 1) * 2 ♦ j 

IF EasErrorl GOTO EHSErrHandler 

NEXT 

seek to beginning of current block 
FSeek PotyseeyFILr. (1 - 1) * Thirty Tvctt 
IF Dos Errors GOTO OOSErrHandler 

IF 1 < Nun32kBLoc« ♦ 1 THEN 

'— get tne 32k block and put it directly Into the EMS page fraae 
FGetA Pol/senyflLE, BYVAL EosPg, BYVAL 0, ThirtyTvokS 

ELSE 

load the left over C<32k) bytes 
FGetA PclyseayFlLE, BYVAL EosPg, BYVAL 0, Leftovers 

END IF 

IF DosErrorX GOTO DQSErrHendler 

.NEXT 

FClose PolyseayFILE 

PileS * PodHrS ♦ ".NDX" 
0ocub»u»S= (FileSlzeSCFUeS) \ 8) 
Li»it - SQJKDocuaNuaft / 300) 
IF Unit < 1 THEN Liait = 1 



LoadedFlag ■ TRUE 
EXIT SUB 

OOSErrHandler: 

Chlae 4 
CLS 

PRINT "Dos Error:-; WhichErrorX 
Freehand les 



ENSErrHandler: 

' Chine 4 

CLS • • 

PRINT •£» error:"; EasErrorX 

Freehand les 

STOP 



UpdateStatusLine: 

huaFi les loaded 3 NumFUes Loaded ♦ 1 
IF LangS 3 -GERIWT THEN 

CPrintRC ■ C EINLADEN DE8 OATEN- ♦ STRSCNunFf lesLosded) ♦ ■ 3 \ 24, ». KersAttr 
ELSE 

CPrintRC • C LOADING 0ATA FILES" ♦ STRS<NuaFilesLoadedl ♦ " 3 ", 24, 29. NeraAttr 
END IF 
RETURN 

END SUB 

FimCTlON NuaS Cx) STATIC 
NuaS = LTRIHSCRTRIHSCSTRSU))) 
END FUNCTION 

SUB PrintAtetr (Firsts, Lasto, FileNun, Redtuag) STATIC 



'show author neae 

•POpenAU OocDirS * ".man", 0, 4, Neaefile 
•IF NaaeFiU «> -1 THEN 
1 OIN Author AS atr40 

• FGetRT NaaeFile. Author, CLNG < ReeNuaft) , 40 

' OPrintftC "AUTHOR: - + RTRIHSCAuthor.Str), 1, 23, NoreAttr 

•END IF 

* FClose Naaeflle 



Llncount a Last* - First* ♦ I 

IF LlnCount « 1 THEN LinCount a 1 

RED IN AbstrTextd TO LlnCount) AS AbstrType 
RED IN AbstrCleenCI TO LlnCount) AS AbstrType 

LenAbst = LENCAbstrText(D) 
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FOR ig = First* TO Last* 

GET Filtftus, ig, AbstrTextCj) 

Start = 1 

TO 

a = INSTRCStarx, AbstrTe*t(j).$tr, CHRS(196)) 

IF a THEM HID$<AbstrText<j).Str, a, 1) • «-•; start « a ♦ 1 

LOOP UNTIL a u 0 0* Start > LEWAbatrText( i >.Str) 

Ab»trCle3nCJ).$:r « AbsrrText< j>.Str J 

CALL CharxjeChar<AbstrClean(J).Str, NewS, KeepS) 

1 wvwwwwwwvwwwwww 
• blank line between sentences 

IF ASC(AbstrText<j).word(1)) » 100 THEN 

AbstrTextCj).word<1) = CHRSUSCCAbstrText(j).word(1)) - 100) 

If j » LinCount THEN 
LinCount = j 

REJIH PRESERVE AbstrTextCI TO LinCount) AS AbstrType 
^ ^ REOW PRESERVE AbstrCteanO TO LinCount) AS AbstrType 

Ab*trT«xtCj).Str = STR1K6$(80, 32) 

B© IF 



J 8 JM 

IF J > LinCount THEN 
LinCount = j 

REDIN PRESERVE AbstrTextCI TO Linteunt> AS AbstrTvoe 
END IF ^ " £SERVE AbStPClean(1 TO "oCount) AS AbstrType 

NEXT 

LinePtr » 1 'set Line pointer 

PrevLinePtr * 0 
PrintAgoin: 
.00 

IF LinePtr <> PrevLinePtr THEN 
PrevLinePtr * LinePtr 

1 Update the 2* lines of text 

IF AscInKee <> UP ARD AscInXee «> Dn THEN 

Print inf creation bar at bottoo 
IF TemTypeftodeS = "LOCAL" THEN 
Lef tCharS = CHR3C26) 
RightCharS .= CHRSC27) 

ExpCharS - "F2" _ fc 

DirOoetharS = "FS" 

ELSE 

• LeftCharS = CHRS(LtftArrovtCey) 
Sight Chart a CHRSCRi^tArrovXey) 
ExoChart = CHRSCShowExprfey) 
DirDocChart = CHRSCDirNuaKey) 

END IF 

If Long* ■ "GERMAN" THEN 

^ infolinrt = ExpCharS ♦ Suchlngaba- 4 LaftChart ♦ -: Na«cntes Dok - ♦ Right Chart ♦ - : Vorhergeh Ook - * Of 
^ f InfolineS . ExpCharS ♦ •: Expr • ♦ LeftCharS ♦ Next - - RightCnart ♦ Prew - ♦ DirOocChart ♦ Dee*. 
CALL References<Infoi1neS> 

END IF 

ULr • 7: )4x ■ 1: LRr » 23: LRc * 80 
IF TeraTypeHodeS - •LOCAL* THEM LRr » 24 

^ S e *l5f KUlr i. ULe /» Ur ' Uc ' Hon8 *" r ^ dear bottom portion 
JJ L j^^^ N ^ E |f r ' 2, Nonrtttr)' box around the bottce Dortion 

"QPrirtac'LHargS"* - " ;~ ~ ~ '~ — 

E JPrint« • HI6HUGHTS CACHTUN6: TEXT KAJOI IRREFUEKREN DA NICHT E! SINNZIISAHHEMHANG)-, 7 r 5, -1 * ' * "° nU 

□Print?: JbrgS ♦ ■ , 
^QPrintRC "HIS! LIGHTS (NOTE: DO NOT RELY ON THIS TEXT OUT OF CONTEXT) ", 7, 12. RevA^' ' ' 

END IF 

IF TeraT/peHodeS = "LOCAL" THEN 
R o 13 

ELSE 

R - 14 . 

END IF 

FOR i « 0 TO R 

ThisLire » i ♦ LinePtr 

IF 1Ms_ire « LinCount THEN 

• onnt line with noma I attrib. 

S^intRC AbstrText(Thi5Line>.Str. i ♦ 8, 2. NoraAttr 

nextmua of three words per line to highlight 
FOR j • 1 TO 5 
. , Word » ASCCAbstrText(ThisLine).BordCj)) 

IF Word *> 0 THEM 

.KubO is nusber of words In the keyworc to highlioht 
. . Lenth • ASC(AbstrText(ThisL1na).Lenth(j)) 

ELSE ^ * WH * C ™ tt ^l»<™«U«Mtr. word, Lenth)), i ♦ 8, word ♦ 1. RevAttr 

EXIT FOR 

END IF 

SHXT 

END IF 
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NEXT 

' Wait for a key is be pressed 

Quit Flag a FALS2 

DO 

InkeeS = INKEYS 
LOOP UNTIL L»:irxeeS) 

IF IBM InkeeS) = THEN 

AacInKee a ASCCUCASESC InkeeS)) 
ELSE1F LEN< InkeeS) = 2 then 

AscInKee a ASC( RIGHTS (InkeeS, 1)) ♦ 200 

END IF 

SELECT CASE AsdmCee 

CASE Up*rrovKey, UP 

2? LinePtr > 1. THEN 

LinePtr a LinePtr - 1 

END IF 
CASE Oo-rUrrowKey, Dn 

Z? LinePtr < LinCount - R - 1 then 
LinePtr = LinePtr ♦ 1 

EM IF 
CASE PcJe-Cey. PgUp 

If LinePtr > 1 THEN 

LinePtr ■ LinePtr - R 
IF LinePtr < 1 THEN 
LinePtr = 1 

END IF 

ENS IF 
CASE Ps>*ey. PgDn 

IF LinePtr « LinCount - R - 1 THEN 
LinePtr a LinePtr ♦ R 
IF LinePtr > LinCount THEN 
LinePtr a LinCount 

END IP 

END IF 
CASE Moaertey, HN 

IP LinePtr > 1*THEN 
LinePtr - 1 

END IF 
CASE Enoeey, EN 

I? LinePtr < LinCount - R then 
^ ^ LinePtr a LinCount - R 

a5E "S^^"***' «, n. «. „. ^. ^ 

Quit Flag = TRUE 

Exltflag a AacInKee 
END IF ^tFlagaTRUE 

CASE He*Sears*ey ( F10 

IF LangS = "6ERNAN- THEM 

i» * QuestionKHEUE SUCHE ? [J/N3 V Mn» -an,- *- 
IP IS = -J- THEN ' ' AOl J * I ) 

Exit Flag a AsdnKte 

Qui tF Lag a TRUE 

ELSE 

SOTO PrintAgain 
END IF . 



ELSE. 



END"IF 

CASE SiS5 
END SELECT 



W iS^ 0 ^ 0 * 5eard * CY/M3 "' ^ -WARNIWi-) 
ExitFLag « AacInKee 
Qui tF Lag * TRUE 

ELSE 

SOTO PrintAgain 

END IF _ ; - 



LOOP UNTIL Qui tF leg 
END SUB 

FUNCTION Questions .(PraaatS, Choice*, labels) STATIC 
ULr a i2 : LRr a ULr 

S W-'prSpt^ 5 * 6 'f**~»»~f*r«* brack.:, C3 and space 

S; C ° U l d ' 2 i *** * ** * Pr «^Len - 1 

OULT = ult , 2: DLRr a LRr ♦ 2: DUlc « ULc - 3: DLRc « LRc ♦ 2 

REDIFI ScrXd TO Array Sfie^CDULr, OULe, DLRr, DLRc)) 

rtSoT^ave caJLr, DULc, DLRr. OLRc. Scr*<1> 

UtndHgr BLr, ULc, LRr, L8e# 4. NomAttrX, NorrtttrX. " 

QPrintRC *C 3", ULr. ULc ♦ LENCPrcaptS) ♦ 2. RevAttrX 

Edits « LEFTSCChoieeS, 1)' default choice 

DO 
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CALL EdttorCECitS. EdLen, Scan. 0, 1, NoroAttr, RevAttr. ULr, ULc ♦ LEN(ProootS) ♦ 3) 

IF scan = ESC AND UbelS «> *UARNIN3i" THEN 
Edits = ■■ 
EXIT DO 

ELSE IF -Scan = HO THEN 'new search requested 
IF LangS - ■GERMAN" THEN 

iS - QuestionS(*NEUE SUCHE 1 CJ/N] °JN\ "ACKTW5 !"> 
IF 1$ » "J" THEN 

SlobalStatus •» NewSearch 



ELSE 



END IF 

END IF 



EXIT DO 

END IF 

1S = Quest1on$("New Search? CY/N3", "w\ °warnin6!-) 

IF IS = "r THEN 

QlobalStatus = NewSearch 

exit oo 

END 15 



IF INSTRCChoiceS, EditS) « 0 THEN 

CALL Chiee(2)' not a valid choice 

Edits = LEFTS (Choices, 1>' default choice 

END IF 

' note: ye can test for scan*-1 (function key 1) & pop up a help window 
LOOP UNTIL INSTRCChoiceS, Edits) AND Scan * CR 
Questions ■ EditS 

HScrnRest DULr, DULc, DLRr, OLRc, ScrXO) 
ERASE ScrX 

END FUNCTION 

SUB ReadEnglishText (TxtS) STATIC 
If LEN(TxtS) a 0 THEN EXIT SUB 

EndOf Sentences = ".!?" . 

i replace all quotes blth spaces so as not to complicate the 

• lower easing of the first words of sentences 

CALL ReplaceChar(TxtS, CHRS(3A). " "> 
CALL Sections (TxtS) 

. process text, first aaking all initial letters 

of each sentence lower case 

Start ■ 1 

LENTxt = LEN(TxtS) 

DO 

- 00 

IF Start. > LENTxt GOTO DoneLowerCase 

p a Jn»trTbl2( Start, TxtS. Endof Sentences) 

IP p s o THEN ' no end-of-sentenee punctuation was found, so exit 

goto DoneLowerCase 
ELSE check for a NLG (honorific/title) 
Start = p + Z 



FOR i » 1 TO UBOUNDCNL6S) 
If p > LENNLG(i) THEN 

IF LCASESWIDSCTxtS, p - LENNLC(i), LENNLG(i))) » RUS(i) THEN 

P = 0 

EXIT FOR 

* END IF 

END IF 

NEXT 

END IF 
LOOP UNTIL p 
ps p* 2 

IF p > LENTxt GOTO DoneLowerCase 
CurrChar - HidCharSCTktS, p) 

DO WHILE CurrChar « AscUpperA OR CurrChar > ASCUpperZ 

P n p * 1 

IF p > LENTxt GOTO DoneLowerCase 
CurrChar o NidCharX(TxtS. p) 

LOOP 

NI0S(TxtS, p. 1) » CHRSCCurrChar ♦ 32) 
Start ap»1 • 
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DoneLcuerCase ; 

•Firstletter = ASC ( LEFTS (LTRWS(TxtS), D) 
■If Firstletter > 65 AND FirStLetter < 90 THEN 
■ NIDSCTxtS, 1, 1) s CKRS(F1rstLetter + 32) 
'END IF 

NewS = ■ " ' replacement string for punctuation 
OldS = ":/.-(X>- 

FOR j ° 91 TO 96: OldS = OldS ♦ CHRS(j): NEXT 
LENOld a LEN(OldS) 

CALL ReaCtrKTxtS, mm V replace all Ctrl chars with blanks 

• replace only SCNE punctuation with spaces 
FOR j = 1 TO LENOld 

CALL ReplaceChaKT-uS, HIDStOldS, j, 1), NewS) 

NEXT 

CALL StripRangedxtS, 33, 37, TLen) '— strip Punctutation ! to X 
TxtS = LEFTS(TxtS, TLen) 

CALL StripRangedxtS. 3?, 47, TLen) »— strip ' to a 
TxtS - LEFTS (TxtS, TLen) 

CALL StripRangettxtS, 58, 64, TLen) *— strip ■ to a 
TxtS - LEFTS (TxtS, TLen) 

• CALL StripRangedxtS, 123, 2S5, TLen) '— strip High chars 
TxtS * LETOCTxtS, TLen) 

CALL Crunch<Txt$, - % TLb*)' crunch all aultiple spaces to 1 
TxtS = LEFTS (TxtS, TLen) 

TxtS » LTRINSCRTRINSaxtS))' reaove spaces from left & right 



SUB RteoWmanText (TxtS) STATIC 
CALL Sections(TxtS) 

CALL LouerCTxtS)' convert all chars to lower case 

CALL RtsCtrKTxtS, *")' replace aLL Ctrl chars with blanks 

• replace only SORE punctuation with spaces 
NewS a ■ " 

OldS = •r/.-OCJO" 
FOR j » 1 TO LEN(OldS) 

CALL ReplaceCharCTxtS, NIDSUldS, j, 1), NewS) 

NEXT 

CALL StripRangeUxtS, 33, 37, TLen)' strip Punctutation ! to X 
TxtS = LEFTS (TxtS, TLen) 

' Note: the range is thru chrS(96) because all the letters are lower < 

• and all maters are being stripped out too. We've skipped over 38 

• because it's the & char which is allowed 

CALL StripRongeUxtS, 39, 47, TLen) 
TxtS - LEFTSCTxtS, TLen) 
CALL StripRangeUxtS, 53, 96, TLen) 
TxtS « LEFTS (TxtS, TLen) 



CALL StripRangeCTxtS, 123. 255, TLen)' strip High chars 
TxtS « LEFTS (TxtS, TLen) 

CALL CrunehdxtS, " ", TLen)' crunch all iultlple spaces to 1 
TxtS = LEFTSdxtS, TLen) 

TxtS * LTRlHS(RTRlHS(TxtS))V reoove spaces fros left & right 



SUB References (Texts) STATIC 

' — prints a centered status line on the bottoa row (25th line) of the screen " ~ 
TextS = STR1NGS((80 - L£N( TextS)) \ 2. 32) ♦ TertS + STRINGS ((80 - LEN(TextS)) \ 2, 32) 
IF LEN(TextS) < 80 THEN TextS = TextS ♦ * " 
IF* TeraTypeHodeS = -LOCAL" THEN 
R = 25 

ELSE 

R - 24 

END IF , 
IF Crtf .NonTyp - 2 THEN 

QPrihtRC Texts, R, 1, RevAttr 

ELSE 

QPrintRC Texts, R, 1, OneColord, 7) 'RevAttr 

END IF 
END sua 

SUB ScrSR CSRS) STATIC 

STATIC VindowOpen 

IF SRS 9 -S" THEN 

RED IN ScrXd TO 2000) 

HSernSave 1, 1. 25, 80, SE5 ScrXd) 

Ufndowopen = TRUE 

ELSE 

IF VindowOpen THEN 

NScmRest 1, 1, 25, 80, SEG ScrXd) 
ERASE Scri 
UindowOpen ° FALSE 
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END IF 

END IF 
END SUB 

SUB Sections CTxtS) 

• * — Look for "sections" or "articles" 
IF LangS = "GERNAN" THEN 
SearchStrS = "Par" 

ELSE 

SearchStrS = "Sec" 

END IF . 
. LETTERS * "2S" 

FOR LookStep * 1 TO 2 
Start a 1 

oo 

a « INSTR (Start, TxtS, SearchStrS) 
IF o THEN 

cS - "0123456789XVI - 

j * InstrTbUe, TxtS, cS) 

IF j THEN 'if this is not a Last word 
UordS = MDSCTxtS, a, j - o) 

ELSE 

EXIT DO 

END IF . 
NunFound 3 3 
IF LookSttp ° 1 THEN 

IF LaooJ a "GERMAN" THEN 

CALL FindExact(VARPTR(Par«graphS(D), Numfound, WorcS) 

ELSE 

CALL FindExact(VARPTIl(SectionS(1)) f JtuaFcund, IterdS) 

END IF 
ELSE 

if Langs = "GERJian- then 

CALL F1ndExaet<VARPTR(Art1kelS(1)>, Nuofcund, UerdS) 

ELSE 

CALL FindE*act<VARPTR(ArticleS<1)), Nuc Found, ttordS) 

END IF 
END If 

IF Nun found <> -1 THEN 
k = j 

IF HidCharCTxtS, k) = 32 THEN k a k ♦ 1 
00 

ChS » «DS(TxtS. k, 1) 
k • k ♦ 1 

LOOP UNTIL CM «> " " 

al » 0 

.00 ■ - 

ChS HlOSOxtS, k ♦ Hi - 1, 1> 
ol = al ♦ 1 
LOOP UNTIL ChS < "0" OR CM » *** 
IF ml > 1 THEN 'there is a number 

Nufttt 3 KIDSCTxtS, k - 1, al - 1) 
IF VALCNubdS) <a 3000 THEN 

IF LoobStep a 2 AND VALCNuett) > 30 GOTO NeXTStep 
NewordS .a LETTERS 4 NuabS 
Oil = INSTRCk, TxtS, " ") 
IF o1 a 0 THEN 

TxtS a LEFTS(TxtS, o - 1) 4 NeuUordS 
ELSE 

TxtS a LEFTS (TxtS, n - U ♦ NewUonrt 4 RIGHTS (TxtS, L0KTxtS> - .1 
ENl 

' END IF 

ELSE 

IF SearchStrS a -Art" THEN 

o1 • 0 

DO 

_ ChS = HIDSCTxtSrk 4 Bl - 1, 1; •- 

>1 ° bU 1 

LOOP UNTIL ChS a — OR CM * " " OR Dl > LEN(TxtS) 
NuobS = MIDSCTxtS, k - 1, Ol - 1) 
NusFound a 30 
CALL FiPCcx3Ct(VARPTR<NusbersJ(1)), NuaFound, NumbS) 
IF Nuafound <> -1 THEN 

NeuUordS a "m° * LTRIH$<STRS(NuaFowjd 4 1)> 

■T a INSTR(k, TxtS, ■ "> 

IF ttYt = 0 THEN 

TxtS a LEFTSCTxtS, B - 1) ♦ NauUordS 

ELSE 

TxtS a LEPTSttxtS, o - 13 4 NeuUordS 4 RIGHTS (TxtS, LEB(TxtS) - o1 4 1) 
END IF 
END IF 

END IF 
END IF 

END IF 
END IF 

NextStep: ■ • 

Start • « ♦ 1 
LOOP UNTIL n • 0 
SearcMtrS a "Art" 
LETTERS a »za» 
NEXT 
00 

a o INSTR<Start, TxtS, CHRS(21)> 
IF D THEN 

k » a ♦ 1 
DO 

CM a -iDSCTNtS, k, 1) 
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k = k * 1 
LOOP UNTIL CM «> • ■ . 
' ol = 0 
DO 

ChS = HIDSCTxtS, k ♦ art - 1, 1> 

o1 = a1 ♦ 1 
LOOP UNTIL Cht < "0" OR ChS > "9" 
NuabS = NIDSUxtS, k - 1, al - 1) 
lr VAL(NuobS) < 3000 THEN 

NewUorsS s "is" ♦ MunbJ 

■1 = iNSTRCk, TxtS, ■ «) 
IF o1 = 0 THEN 

TxtS a LEF7S(TxtS, Q - 1) + ifcwUcrdS 

ELSE 

END IF TXt$ = UFn<TXt, » ■ - » ♦ "evUordS ♦ RIGHTS (TxtS, LEN(TrtS) - art ♦ 1) 
END IF 

END IF 
LOOP UNTIL » « 0 

END sue 

SUa Selectftenu (ChS, Expression AS ExpressienTypa, HotStrS, GlobalStatus) STATIC 
SELECT CASE ChS 

CASE niDSOtotStrJ. 1, 1) 'other Suggested torts: SUAPS/xeletives 

OtherwonJs Expression, UordPtr 
USE MOSCHotStr*. 3, 1V Delete a Uord 

Deletetord Expression 
USE HIDSOtetStrS. 4, 1). .insert Last Deleted yard 

InsertUord Expression 
CASE MDSCHotStrS, 2, 1)« Add New words 

AddSearchTeras Expression 
CASE HIDStHotStrS. 6. 1}' New Search 

IF LangS = -GERMAN' THEN 

v u^" 0 ^*" 8 * SMHS ? tJ/w ' JM "' " ACHIUNS ! *> 

GlobalStatus a NewSearch 

ELSE 

Global Status a 3 

END IP 

ELSE 

If -. 0Ue 2i O 2i! W!W ***** CY ^ 3 "' "WARNING!")* 
*r i» 3 j THEN 

GlobalStatus = NeuSearcb 

ELSE 

GlobalStatus = 3 

END IF 

END IF 

CASE NIOS<HotStrS, 5, 1)» view records 
GlobalStatus * EditSearch 
RED1N NatshRecortbCI TO 1) AS RecInfoType 
RankReeoros KatchRecordsO, Nuo^ound. Expression 
IF NuaFound > 0 THEN ^ 
^ ^ Snewboe HatchReeordsO, NusFound, Expression, RecNua 

CASE MDSOtatStrS. 7, I) 1 Exit prograa 
ScrSR -9" 
GlobalStatus 3 ex 
CASE ELSE 
END SELECT 

END SUB 

SUB Show Aba tr (RecNusft) STATIC 
DIN FileHdx AS ISANtype 
FilNua » FREEFILE 

-^■S" SZE&TgSLP* m« 

CLOSE FilNua 
FilNua = FREEFILE 

OPEN AbstrDirSj -.TXT" ?CR RAKDOtt ACCESS READ SHARED AS FilNua LEN * 88 
CLOSE FiSiut W «^ , '< M ««- M '*. FileNdx.tast, FilNua, Racing) 

END SUB 

SUB Shovooc {Short) AS RecInfoType, NuaShow, Expr AS ExpresaionType, RecNua) STATIC 

REDU1 ShouHeadersO TO feaShow) AS HaaderType 
DIN KeyNdxTeap AS KeyNC.TypelW 

PrevGlobol Status a G local Status 
Curr ■ 1 

Lost Flog • FALSE 

GOSU8 File 

DO 

SELECT CASE Exit Hog 

USE F2, ShowExprtey • show the current expression 
REDIH Scrttl TO 2000) 
UU fWcmSaved, 1, 25, 80. SES SerXM)) 
ULL ShowExpr(Expr) 
CALL ShowQuery 

UU MScrnRestO, 1, 2$. «, SES ScrXH)) 
* E3ASE ScrX 

I File 
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""'S^Tttrr-cu^-, Occult 
G0SU3 File 

CASE F3. 277, RightArrcwKey > Uex x Docuoent 

If Curr < RusShow THEN Curr = Curr + 1 
G0SU9 File 

CASE FS. OirHusK*y 'Direct type <r-ruw>er 

Chios 5 

SaveCurr = curr 
30 

CL5 

InfoLineS = "EscrfTevious Screen Enter: To Go Directly to Entered Document Huabar* 
CALL References (InfoLineS) 
IF Cnf .RmTyp = 2 THEN 
* Col » RevAttr 

ELSE 

Col 3 OneCoLorlK, A) 

END IF 

CleorScrO 2, 20, 4, 65, Col 

CAU 0rawfiox(2, 20, 4, 65, 2, Col) 

OPrintRC " ESTER DOCUMBIT NUMBER J1 -" ♦ SmtUBCOOlShOw)) ♦ "1 3, 26, OneColor{U, 
CurS 

CurLen * 4 

CALL Editor (CurS, CurLen, Scan, 1, 0, RevAttr, RevAttr, 3, 60) 
If Scan * 13 THEN 

Curr = VAL(CurS) 

If Curr > UBOUKO(Show) OB Curr « 0 THEN 
Chine 6 

ClearScrO 5, 25, 8, 61, Col 

CALL OravBoxXS, 25, 8, 61, 2, Col) 

OPrintRC "Kuaber ie too big or not -ralid.", 6, 29, Col 

QPrintRC "Press space bar to contirve.", 7, 30, Col 

Ua it Space 

OPrintRC ■ «\ 3, 60, -1 
COLOR Fg, EG 

ELSE 

COLOR fg, BG 
GOSUB File 

END IF 

ELSE 

IF Scan * 27 THEfl 

Curr = SaveCurr 
G0SU3 File 
ELSE 

OPrintRC - 3, 60, -1 
GOTO TooBig 

END If 

END IF 

-OOP UNTIL (Curr <° UBOUKDCShow) AND Curr > 0) 

CASE F-.C, NeuSearchKey . 'HewSeorch ' 

S.obalStatus = NeuSsareh 
•iistf lag » FALSE 
EXIT sua 

CASE C8: 

SIN FileNdx AS I SAN type 

'revGlobalStatus » Global Status 

mm Scr2XC1 TO 2000) 

call nscrasaved, 1, 6, 80, see Scr2XC1)> 

CLS 

SecNumS = Show ( Curr ).Rec 
F1LNUO a FRKFILE 

0 s EN OocOir* ♦ \NDX" FOB RANDOM ACCESS READ SHARED AS FUN.. L£N = 8 

GET FilNua. Redtuaa. FileNdx 
CLOSE FilNua 



FilNua - FREEflLE 

OPEN DocOirS + ".TXT" FOR RANDOM ACCESS READ SHARED AS FilNue LEN = 80 

CALL fullTextCfUeNdx.Mrst, fileNdx.Last, filNva, E/srKeysS) 
CLOSE FilNua 

SELECT CASE Ex it Flag 

CASE ESC: 

CLS 

CALL HScmRestO, 1, 6, 80, SEG Scr2XC1)) 
CALL ShouAbstKRectfu*) 

If Exitflag » ESC THEN Hi at Flag » FALSE: GCSL3 file 

CASE 277, F3, RightArrouXey 

IF Curr < NuaShou THEN Curr a Curr ♦ 1 
GOSUB File 
CASE 275, F4, LeftArrovXey 

If Curr > 1 THEN curr « Curr - 1 
I file 



END SELECT 
ERASE Scr2X 



END SELECT 
LOOP UNTIL Exitflag » ESC 
EXIT SUB 
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File: 

CurrRecft = Show CCurr). Rec 
CurrRecDebug = Curr , ***rs debug 
CALL ShowteyuordsCCurrRecft) 

QPrlntRC "f ♦ LTRIHSCSTRS(Curr)) ♦ »J\ 1, 75. RevAttr 
FilNua = FREEFILE 

OPEN Doc01r$ ♦ ".WX" FOR RAW DOM ACCESS READ SHARED AS FilNua LEN = B 
6ET FilNua. CurrRec8, FileNdx 
CLOSE FilNua 

KuaPg » (FileNdx. Last - FileNdx. First) / 60 
IF RuaPg = 1 THEM 

OPrintRC "Doc - ♦ STRS(Currfiecft) ♦ " • + " 1 page", 1, 3, NoraAttr 
ELSE 

OPrintRC "Doc" ♦ STRS( CurrRecft) ♦ ■ • + STRS(NuaPg) + ■ pages", 1, 3, HoroAttr 
END If 

GFrintRC ■CDoc" ♦ STRSl CurrRecft) + "3", 1, 3, RevAttr 

IF Curr ■ 1 THEN CALL ShowHistlNunShow, Curr, Expr) 
IF Curr *= NuaShow THEN 

IF (HmFlag *> CR) AND Curr <> 1 THEN 

CALL ShowHistCNuaShoy, Curr, Expr) 

ELSE 

IF Curr > 2 THEN 

IF (HatchRecValsC Curr). Value < .6 * HatchRecVaUtCurr - 1). Value OR HatchRecValsC Curr). Value < 

END IF 

END IF ... 

END IF 

•CurrRecft = CurrRecft HOD 600 

DO WHILE (Hist? lag <> ESC AND HistFlag o CR) AND (Curr » 1) 
SELECT CASE HistFlag 
CASE 275. FA, LeftArrovXey »— previous 

IF Curr > 1 THEN Curr = Curr - 1 
CurrRecS » Shout Curr). Rec 
CurrRecDebug =* Curr '***ra debug 
CALL ShovXeywords(CurrRecft) 

OPrintRC "C" ♦ LTRIRS(STRS(Curr)) ♦ T. 1. 75. Rev*~r 

FilNua * FRESFILE 

OPEN DocOirS * " . NDX" FOR RAH DON ACCESS READ SHARED AS FilNua LEN • S 
GET FilNua, CurrRecS, FileNdx 
CLOSE FilNua 

RuaPg • (FileNdx. Last - FileNdx. First) / 60 
IF BuaPg » 1 THEN 

OPrintRC -Doc- + STRSCCurrRecS) ♦ ■ ■ ♦ " 1 page", 1, 3, NoraAttr 
ELSE 

OPrintRC "Doc* ♦ STRS(CurrRecg) ♦ " • ♦ STRS(NuaPg) ♦ » pages". 1, 3. NoraAttr 
END IF 

OPrintRC "CDoc" + STRS(CurrRecft) + "3", 1, 3, RevAttr 

IF Curr > 24 OR Full Flag THEN 
IF Full Flag THEN 

CALL ShouHiitCNunShou, Curr, Expr) 

ELSE 

CALL ScrollHiatCCurr, "R", NuoSnow) 

END IF 

ELSE 

CALL Rewrite* 1st (Curr) 

END IF 

CASE 277, F3, RightArrouKey *— next 

IF Curr < NuoSnow THEN Curr » Curr ♦ 1 
CurrRecft a Show(Curr) .Rec 
CurrRecDebug » Curr •*••« debug 
IF NOT Last Flag THEN CALL ShowJCeywordsCCurrReca) 
OPrintRC "C" ♦ LTRIHS(STRS(Curr» ♦ »r. 1, 73. RevAttr 
OPrintRC "CDoc" ♦ STRSl CurrRecft) ♦ "3". 1, 3, RevAttr 
FilNua = FREEFILE 

OPEN DoeDirS ♦ ".NDX" FOR RANDOM ACCESS HEAD SHARED AS FilNua LEN » ft 
GET FilNua. CurrRecft, FileNdx 
CLOSE FiUiua 

HuaPg s CFUeMdx;Last - F1leNdx;F1rst) / 60 — — .V — * — . -- 

IF NuaPg a 1 THEN 

OPrintRC -Doc" ♦ STRS( CurrRecft) ♦ ■ " ♦ " 1 page-, 1, 3. HormAtt- 
ELSE • 

OPrintRC "Doc" «• STRSC CurrRecft) ♦ " " ♦ STRS(NuaPg) ♦ " pages-. \ 3. NoraAttr 
END IF 

IF Curr > 25 Oft FullFlag THEN 
IF FullFlag THEN 

CALL SnowHistCNuaShow, Curr. icr) 

ELSE 

CALL ScrollHistCCurr, *L*, ^aShcu? 

END IF 

ELSE 

CALL ReUriteHisttCurr) 

END IF 

CASE HoaeKey 

IF Curr <> 1 THEN Curr a 1 

CurrRecft 3 Shou(Curr).Ree 
CurrRecDebug » Curr . '***rs debug 
CALL ShouKeyuordsl CurrRecft) 

OPrintRC "C ♦ LTRInS(STRS<Curr)) ♦ "3\ 1. 75. RevAttr 
OPrintRC "CDoc" ♦ STRSC CurrRecft) «• "3\ 1, 3. RevAttr 
CALL StwuHUtCNuaSnow, Curr, Expr) 

CASE is, EndXey 

IF Curr <> NuaShou THEN Curr = NuaSnou 
CurrRecft ° Show C Curr). Rec 
CurrRecDebug ■ Curr '***r» debug 
CALL SbovKeywords ( CurrRecft) 

OPrintRC ■£" ♦ LTRINS<STRS(Curr)) ♦ "J*, 1. RevAttr 
OPrintRC "CDoc* ♦ STRS (CurrRecft) ♦ "3", 1. 3. RevAttr 
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CALL ShovHistCtsjsShov, Curr. Expr) 
La st F leg « TRUE 

CASE ELSE 

EXIT DO 
=sa SELECT 

LOOP 

IP HistFlag a ESC THEN EXIT SUB 
ExitFlag = His:fiag 

• IF GlobalSraiJi «> NeuSearch AND Exit Flag = Cft THEN CALL ShovAbstrCCuf rRe--S) 
IF ExitFlag = ESC THEN HistFlag = FALSE: GOTO File: 
RETURN 

END SUB 

SUB ShouExpr (Expr AS ExsressicnType) STATIC 

call ClearfiG 

TotSubExpr a 0 * - 
IF Expr. Mud THEN 

FOR i « 1 TO Exsr.Nun 

TotSubEor a TotSubExpr ♦ Expr. SubExpr(i). Nua 

NEXT 

ELSE 1 no search expression has been entered/ so display a note to the user 
NoteS ■ "No Search Expression Yet" 

ULr « 2: ULc - 2: From ■ 1: LRr ■ ULr 3: LRc • ULc ♦ LENCNoteS) ♦ 4 
CALL VindNgrCULr - 1. UU ♦ 1, LRr ♦ 1, LRc ♦ 1, Frame, NormAttr, RevAttr, -> 
CAU OPrintRCCNctel, ULr * 2, ULc 4 2, -1) 
EXIT SUB 

END IF 

IF TotSubExpr > 17 THEN * can't display aore than 17 on the screen 
IF LangS = •»G£RW THEN 

DispNsg "Zuviele Worte! Systeo verarbeitet oaxiaai 20 Suchbegriff e", 0, 0 
ELSE 

OispNsg "Cannot process: You have too cany words in your search expression. \ 0. 0 
END IF 

CALL VaitSpace 
CALL ChiaeO) 
EXIT SUB 

END IF 

Old - Expr. SubExpK1>. Nua 

RED IN pSC1 TO Expr.Nuo, 1 TO Old) 

NaxWordLen * 34: LRr « 2 

'— find noxious word length 

FOR S a 1 TO Expr.Nua 

IF Expr.Sub£xpr(S> Nua > Old THEN RED IN PRESERVE pS(1 TO Expr.Huo, 1 TO Expr.Sub£xpKS).NUe> 
FOR i » 1 TO Expr. SubExpr(S). Nip 

LRr = LRr ♦ 1 

c = Str2Coder(sxpr.SubExpr(S). Phrase, i) 
pS(S, i) a DictS(c) 

IF LENCpSCS, i)3 > HaxUordLen THEN HaxUordLen « LEN<pS(S, i)) 
NEXT 

NEXT 

' draw the Expression box only 1f FindExpr was called 

• (I.e. searched on the nx>le expression) 

IF Expr. hatch » 0 THEN " 1t wasn't -1 so FindExpr was called 

ULr » 2: ULc a 2: Frane = 219 
LRc • ULc ♦ haxUordLen ♦ 2 

END IF 



ULr a 3 
ULc a 3 
Frane a 1 

LRc a ULc ♦ HaxUordLen ' -1 
nS a "ffOocs" 

CALL UindHgrCULr ♦ 1, ULc ♦ 1, LRr ♦ 1 # LRc * 1, Fraae, NoroAttrt, NoroAttrt. nS) 
FOR S a 1 TO Expr.Nua 

FOR 1 s 1 TO Ex3r.SubExpr(S).Nun 

IF p$(S. i) > THEN OPrintRC p$<S, i), ULr ♦ i, ULc ♦ 1. -1 

NEXT 

NEXT 
END SUB 

SUB ShowXeywords (RecNwaS) STATIC 

1 4 rows" of 80 colwsis. including window 

:ULr » 1: ULc » 1: LRr - 6: LRc = 60: LenCol ° LRr - ULr - V nan 
•CLS 

CALL ClearScrOCULr, ULc, LRr, LRc, RevAttr) 1 clear bottoa portion 
CALL OravBoxCUU, ULc. LRr, LRc, 2, RevAttr) 

' now display keywords 

0IN KeyNdxTtcp AS !CeyNoxType127 - 

get 63-word list frea "newkey.ndx" 
FGetRT KeyNdxFilB, KeyMxTesp. RecNuaft, LEN CKeyNdxTeap) 
IF LangS a "GERMAM- THEN 
CALL OPrintRC(LHargS * - WICHTIGSTE BEGRIFFE: ■ ♦ RhofgS, ULr 28- -1) 
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W °S - * «Y words: - ♦ MargS, ULr. a. -« 

KeepExprS = 
1 " debug 

. 32767 0* K^T«p.Ku, «', j^k LOCATE ,8. „ BEEP: M 
•nark with words that natch the Query' 

Jf k S To n ^ C£xprKeys5 ' ^**W-P.str. i)) 

Keywords Ci). Flag » •»» 
ELSE 3 

Keywords CD. Flag = ■ - 

END IF 

'get word and cut it up to 22 letters ♦ • ■ 
• = 0ictS(Str2Coder«CeyNdxTeap Str tJ) ** 
x$ = RIGHTS (xS, LEN(x$) - 5) 
CALL Cui'Jord(xS) 
Keywords (D.Uord = xS 
^ " KeyWords<1).code* Str2CocWClCeyfto.Teap.Str, t) 

IF KeyNdxTeiEp.Nuo > 12 THEN 

'sort rest of kw by flag,1.e. push ntched words up 
SortT KeyVords<13), KeyNdxTeap.Nu, - 12. 1, lENOCeyUordsCI)), 0, 1 
look for how nny word we should replace to f1rst-12 part 

DO 11 uo " 1 * ar ° " a,ch « 1 * if t *«» deletlr* last noMMehed word, before the. 

n n o 

Sel?J!S U il^r2jl , J,"lJS d f." ^ """^ «■»«•«» totto. 

VXSu *5f? m ' °' «""''""*«»• °- ' 

i a i ♦ 1 

NEXT 

Key«ords(12 - Found) * Keyllords<12 - Shift) 'shift kw 
KeyuordsC12 - Shift). Flag « ■ - 
EXIT FOR 

ELSE 

EHOIF n = n * 1 «ny words should be shifted 

NEXT 

END IF 

LOOP UNTIL Shift » -1 OR n >= 12 - Found 

^chec< could we shift everything. If not, shift it and decrease Found 

!^Tth^ 12) ' ' °' ^CKoyUoroatl),. 0, 1 

Found = Found - 1 

•look for the first word which can be shifted 

i^nSff^^ " F1rstSpsce ' °- °- 1 

FirrtSpace - FirstSpace ♦ 1 'real spaee 

'SSLl J!/ SMft 'If there t. -here to shift it, do it 
fcyiterdsCFirstSpace) = Keywords C1Z - Shift) 
aD ^ KeyUords<12 - ShifO.Flag * - - 

END IF 

©10 IF 
LOOP 1*TIL Shift » -1 
f Oft i » 1 TO Found 

HEXT - » - key«ords(12 ♦ Found . i ♦ ivrepvate oatched keywords beyond 12 to 12 

END IF 

CurrCoL » ULc - * 

n-3-U80UX0(Keyttorss)— — . ._• 

IF KeySdxTeap.V^i < 12 THEN 

REDO* PRESERVE Keywords (1 TO 12) AS UcrdShovType 
FOR i » KeyNdxTesp.Nua ♦ 1 TO 12 
KeyVordsCi).Word 

NEXT 

END IF 

FOR i o 1 T0*4 



***rs debug wvwwvwv 

get Salton value for this word ***ra 
Code a KeyuordsCD.Code 
if Code > 0 then 

' M .£^J^ C ^ (C0 ^- lnde * 10 ™^<Code).lndex ♦ KYInch (Code) Nun - 1 
FGetRT KYImertDatFILE, KYInfo, Index*, KVInfoLEB 
IF KYInfo.Rec a RecNUcft THEN 
SaltonValue « KYInfo. Value 
EXH FOR 
END If 
NEXT Index& 

^OFHntRC LEFT,(OictS<Code), 15) ♦ - S:« ♦ STO(SaltonValue), ULr ♦ i, CurrCol ♦ 2, -V*"rs debug 

***rs debug 

IF Keywords Ci). Flag <> ■ ■ THEN 
• CPrintRC KeyWords(1>.Uord. ULr ♦ i, CurrCol ♦ 2, Nc-Attr 



END IF 



OPrintftC Keyttords(i).lterd. ULr ♦ t, CurrCol ♦ 2. «1 
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• — -•«*r$ debug wvvwww 
. — get Satton Value for this word ***rs 
1 Code = KeyVordsd * 4>.Code 
» if Code > 0 then 

• FOR Index* » KYIndx(Code). Index TO KYIndxC Code). Index ♦ KYIndxC Code). Hub - 1 

• FGetRT KYInvertDatFILE, KYlnfo, Indexft, KYInfoLEN 

• IF ICY Info. Rec = aecmo« THEM 

Sal ton value = KYlnfo. value 
EXIT FOR 
END IF 
NEXT Index* 

• QPrintRC LEFTS<0ict$(Code), 1S> + • S:" ♦ STRS(SaltsnValue), ULr ♦ i. CurrCol ♦ 25 ♦ 2, -V***r* debug 
end if 

' ***ra debug 

IF KeyWordsfi ♦ 4). Flag «> ■ " THEN 

C?rintRC KeyUords (i ♦ 4) .(lord, ULr ♦ f, CurrCol ♦ 27, NormAt-r 

ELSE 

QPrintRC KeyWordsCi ♦ O.Uord, ULr ♦ i, CurrCol ♦ 27, -1 

END IF 

■ ***rs debug vwvwvwv 

• ^ salton Value for this word ***rs 

• Code = KeyUordsCi ♦ 3). Code 

• if Code » 0 then 

• FOR Indexft = KYIndx(Cod»).Index TO KYInd*<Code>. Index ♦ KYIndx(C©de).Nua - 1 
' FGetRT KYInvertDatFILE, KYlnfo, Index*, KYInfoLEN 

» IF KYlnfo. Ree = Recffua8 THEN 

SaltonVaLue = KYInfo.Value 
EXIT FOR 
END IF 

• NEXT IndexS 

QPrintRC LEFTS C31 CIS (Code), 15) ♦ ■ S:" ♦ STRSCSaltonValue), ULr ♦ 1, CurrCol ♦ 50 ♦ 2, -r***rs debug 
end if 

• ***rs debug *"* 

IF KeyUsrasCi 8). Flag <> • • THEN 

SPrintRC KeyWordsCi ♦ 8).Word. ULr ♦ i, CurrCol ♦ 52. NonaAttr 

ELSE 

. QPrintRC KeyWordsCi «■ 8). Word, ULr * i, CurrCol ♦ 52. -1 
END IF 

NEXT 

— ***rs debug wwmvw 

• QPrintRC "Document ^aru:" * STRS<RecsDebuo£<CurrfiecDebug)), 6, 1, -1'***ra debug 

• Currftov = CurrRo- * 1 

' IF Cur -"tv = ,J*r THEN 

1 IF ..rrCoi = OldCurrCol OR CurrCol » OldCurrCol + 25 THEN cove eve- to next eoluon 

• CurrRo- = OldCurrRow 

• CurrCcv = CurrCol + 25 

• ELSE -e're already at the bottoo of the 3nd eoluen so we're done 

• EXIT ?3R . 

• END IF 
END IF 

• ***rs debug ~ 

END SUB 

SUB ShouQuery 

RED IN ScrXCI TO 8001 

HScmSave 20, 1, 25, 8C. ScrXd) 

ClearScrO 20, 2. 24, 79. NoroAttr 

CALL DravSwC20, 2, 24, 79, 1, NoroAttr) 

FOB i = 1 TO 3 

IF LENCSentenceSC*)) THEN QPrintRC SenteneeSCi), 20 ♦ i, 3. -1 

NEXT 

CALL References <" PRESS *J»Y KEY TO CONTINUE 11 ) 
00 

CM ° XNKEYS 
LOOP. UNTIL LEN(ChS) > 0 
HScrnRest 20, 1, 25, 80. ScrXO) 
ERASE ScrX 



END SUB 

FUNCTION SpacettuoS (x. Space!) STATIC 

SpaceNunS = RIGHTS ( STRINGS CSpactf, " ") * MuaStx), Space*) 

END FUNCTION 

FUNCTION str2£odeX (Stores, LocatlonX) STATIC 

Str2CodeX = cvi(niDS(StsreS, LocatlonX * 2 - 1, 23) 

END FUNCTION 

SUB ValtSpace STATIC 

CALL Clearfiuf 

DO 

Ky$ = INXEYS 
IF Ky$ n • - THEN 
EXIT SUB 

ELSEIF Kyi = DiRSCO) ♦ CHRSCF10 - 200) OR KyS - ORKNeuSearthKey) THEN 
IF tangs = "GERHAN" THEN . 

II " **«1«":«»E SUCHE T CJ/N3 % W. -ACHTUNO ••) 
IF 1S a - J" THEN 

Global Status = NewSearch 

EXIT SUB 
END IF 

ELSE 
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7$ = Quest ionSC "New Search? CY/N3", "YN", RAWING!") 
IF i$ = "V THEN 

GlcbotStatus • KevSeorch 

EXIT SUB 
END IF 

END IF 

END IF 

LOOP 
END SUB 

SUB Uirettgr (UlRow, UlCel, LRRow, LRCol, Fraoe, BoxColr, TextColr, TextS) STATIC 
CALL CtearScrOCULRow, UlCol, LRRow, LB Co I, BoxColr) 

CALL OrawBOxCUlRou - 1. IHCol - 1, LRRow ♦ 1, LRCol 4 1, Fraoe, BoxCoLr) 
IF TextS s "#D0CS- THEN 

CALL QPMntRCCTextS, UlRow - 1, UlCol, TextColr) 

ELSE 

IF LEN(TextS) ThEN CALL OPrlnTRCCT* ♦ TextS ♦ T», UlRow - 1, UlCol ♦ 1 r Te*tColr> 

. . ENO IF 

END SUB 

SUB WordPorse (TxtS, WorcLUtSO, MunUorda) STATIC 
Sunwards ■ 0 

TotU » InCountXCTxtS, " •> + V number of .words in current line 
FOR Word « 1 TO TotU 

CALL ExtractCTxt*, " Word, Start, Slen) 

IF Sleo * 0 THEN 

— extract word 
, wS = WDSCTxtS, Start, Slen) 

fill out 1 and 2 char words with /'a 
IF SLen < 3 THEN wS a wS + STRINGS (3 - Slen, "/") 

' a I Lew only words that start with alphabetic chars ■a*-"** 

Ascw = ascujj 

IF CASC- >= ASCA AND ASCw <= ASCZJ 08 (ASCw AscUpperA AflO ASCw c= 45CUpparD THEN 

Nuauords = Nuuwords * 1 

• the following doesn't apply to GERRAN 

IP LanoS = "EN6U5H* THEN 
• • • IF RIGHTSCwS, 2) a THEN ram the f S 

wS « LEFTSCwS. Slen - 2) 
ELSE1F RIGHTSCwS, 1) « — THEN. and any final * 
- w$ » LEFTSCwS, Slen - 1) 

END IF 

END IF 

: v * store the word 

RE3IN PRESERVE IfordListSO TO NuBUorda) 
ferdListSCNuavords) = wS 

END IF 

END IF 

■ i * NEXT 1 word in line 

END SUB 

FUNCTION ZaroNuaS Cx, Zero) STATIC 

'•— -fill a raaber with leading leros _. 

leroNumS a RI6HTSC STRINGS C Zero, "0°) ♦ NuaS<»), Zero) 
END FUNCTION 

DECLARE SUB SortSvapEffS CHandleZ, NubELsZ) 

DECLARE SUB DrawBox CULROwS, UlCOtf, LRRowZ, LRColX, FraaaX, CoU) 
DEFINT A-Z 

TYPE UordType 

Word AS STRING * 64 
LineHuab AS INTEGER 

END TYPE 
DEFINT A-Z 

CONST FALSE • 0, TRUE » NOT FALSE, ASCEND a 0, Descend ° 1 
CONST RaxShow « 50 

'scan code ♦ 200 for not to aix with letters 

CONST UP a 272, PGUP = 273, Dn = 280, PGDN « 281, HH • 271, EN • 279 
CONST CtrLPgUp a 332. CtrlPgDn a 318, CtrlHN a Jio ctrlEN a 317 
CONST F1 » 259, F2 = 260, F3 « 261, F4 a 262, FS ■ 263 
CONST F6 a 264, F7 a 265, F8 a 266, F9 a 267, F10 a 268 

CONST ESC a 27, CR a 13 

CONST Neusearch a 1. Adewords a 2, Edit Search a 3, Back a 4, Forward a 5, SWAPS • 6 
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•SINCLUDE: '\\vedio\c-drive\user\include\def cnf .bl • 
. 'SINCLUDE: 'VNvadinXc^riveXuserUncludettypes.bi' 
'SINCLUDE: '\\vadin\c^r^ve\use^\INttUDE\Sha^ed.b1• 
'SINCLUDE: •WvBdioXc-driveNuserXEXTeRN.BAS' 

DECLARE SUB Code2Str (Stores, LocationX. Code*) 

DECLARE SUB CPrint (xS) 

DECLARE SUB Dispftsg <MicS, RX, cX) 

DECLARE SUB EtaAlloc (NwaPogesX, Handle*, LoadFUeS) 

KC^ M SlLtKiSr?)^ * EXPrC " i0nTyP€ ' Rel ° AS CollectT >5 *, NuaJtoU, ExeludeS) 

DECLARE SUB OtherUords <Expr AS Express ion Type, UordPtr) 

DECLARE SUB PickCheice l*<) AS Collect Type, HuaHX, labels, PickXO. NuaPlckX ExaetFLao Exnr AS i^Tu~% 

ssj s jSKiffaa ^^^-^r^ «• - «■ 

DECLARE SUB Reference* CextS) 

DECLARE SUB ScrSR (SRS) 

OECIARE SUB ShowExpr (Exor AS ANY) 

DECLARE SUB Sort ENSRank Info (HandleX, HuxZ) 

SeclH *m IlSSLfr^^^ Expr AS ExpresaionType, a <) AS CollectType, ftnK. Repeate, Exact Flag) 

SecL^S S^ SaU^O Ch ° 1Ce2 ' " WUfl *' B ° XBOt3C ' « Confis. WC^lc, PgUK, PgDK, UK, OX, Ter*Type*ode 

DECLARE SUB Ulndflgr (UUowI. UlCoU, LRRowZ, LRCoU, FraaeZ, BoxColrX. Text Co I rX Tex-M 
DECLARE FUNCTION BoxInputS CEditS, Titles, FratS, to*. ColX Scan) ' 
DECLARE FUNCTION CoaboSueft (Bits, Value*. KodaS, PoiyJC)) 
OECLARE FUNCTION DlctS (CodeZ) 

DECLARE FUNCTION FlrstUstX (ItordS, First* , LastX, KeyTypeX) 
DECLARE FUNCTION Key Inst rX (KeyStrS, SrcM) 
OECLARE. FUNCTION KeyNidS (KeyStrS, Start) 
DECLARE FUNCTION NuaS CxX) . 

OECLARE FUNCTION Questions (PronptS, Choices, Label S3 

S££!I ™S™ S^Cx^PaSJ " E * Pr " ti ° nTyPe ' * CollectType, NuaTopXeysX) 

DECLARE FUNCTION StrtCcdeX CsS.' kX) 

'SINCLUDE: '\\vadiB\c-drive\user\include\prefUes.bi' 

SUB AddSearchTerms (Expr A3 Express ionType) STATIC 

Repeate = FALSE 
CurrentSub = 1 
GlobalStatUS = EditSearch 

IF Expr.SubE*pr<1).Nua ■ 15 THEN ' can't add any do re words 
Chin* 4 

IF LangS » "GERKAN" THEN _ 

^ DIspNsg -FehUr: Haxfool 15 Suchbegriffe. LfcWASTE ua ueiterzuaatheni", R, e * 

EHO XF -BW0R: Uflrit °* 15 S ""* Terns. Press the Space Bar to continue:*, R, e 

' UaitSpace 
Oispftsg 0, 0 
EXIT SUB 

END IF 

IF LangS «> "GERMAN" THEN * 

END IF RcferCncea( " E ' WTER ^ « **> THEN PRESS ENTER OR PRESS ESC TO CANCEL") 

' Get search tera(s) froo user 

GetSearchTera: 

ScrSR *S" 

Titles = "Adding Search Uords" 
Row s 19 

TeraS ^InoutSCSPACESCiO). Titles, -Search Uord/Phrase:". Row. 8, Scan) 

IF (ABSCScan) ♦ 200 =» FTO AND Scan < 0) OR Scan » NewSearchXrr THEN 
IF LangS = "GERMAN" THEN 

if* Ts^!" 0 ^^ 3,0(5 ? CJ/N3 ■ JB "' m *°™* tm > 

6lcbaistatus"= NevSeerch ~~ ~ 

EXIT SUB 

END If 

ELSE 

!? ^ e "?°T^N WW S ~ reW tY/W "' -««WINB!") 
Globaistatus = NewSearch 
EXIT SUB 

END IF 

END IF 

END IF 

IF TeroS s ** THEN Exit SL3 

• allocate dunsy space for the oatching codes 

REDIN natchCodesCI TO 1) as CollectType 

' collect the Hatching codes into the array 

CALL TennftotchCTeroS, txsr. HatchCodesC), NuaNatchCodes, Repeate, ExactFlag) 
IF mimhatch^des^ 0TH3.- • no matching entries in dictionary 
IF LangS a -GERHAN- THEN 

E» „ " , ' 0rt " + T — * "• " «™» *• 5*. aar t. . «w Twa ... R , t 
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ELSE 

IF LangS = "GERMAN" THEN 

END IP SiK '"* 9 " Phr "" *" «"»•" - * T '« ♦ - « « Pr^ th. 5P3C. ^ to «.r . Sm , 

END IF 
Ua It Space 
Dlsptlsg o; 0 
ShowExpr Expr . 

SOTO GetSearehTern' gat a new word to search for 



END IF 

IF NuiaiatdiCodes > 1 then 



Note; 



Get one or Bore choiees fron the list froo the substring diet Batch 
the resulting list of matches is stored in word^e^cr? 



therefore, the PiekChoice routine oust ccnvert^eTl^the 0 "* 
codes to words in a temporary array for the PickList routine 



Labels » "Select Search Tera(s>* 
GlobelStatus » Addltords 

REDIN PiekedttlS - Expr. SubExprC Cur rentSub).Nu»J 

CALL PickChoiceCtotchCooesO. NuafJ^cnCodes. Labels, Picked*), NuaPicked. Exact flag. Expr) 
ELSE 

NuaPicked - 1 

RED in riektdd TO 1) ■ - 

PickedM) = 1 * Bat chCodesd). Code 
END IF 

IF NuaPicked - -10 THEN ' MO was pressed: New Search 
IF LangS * »6ERIUN N THEN 

iS * CuestionSCNEuE SUCHE ? UflO "# "JN*, "ACHTUlG !") 
IF iS = "J" THEN 

* GlobalStatus = NewSearch 
EXIT sua 

END IF 

ELSE 



i$ = Questions ("New Search? Dr/N3\ "YN", "warning!") 
IF iS = "V THEN 

GlobalStatus = NewSearch 
EXIT SOS 

END IF 

END IF 

ELSE* IF NumPicktd > 0 THEN 1 add words to expression 

* store choices into current Sob Expr. Phrase 

NuaPhrase = Expr. SuoExpr< Current Sub) .Nun 

Expr. SubExpr(CurreniSub). nub a NuaPhrase ♦ NuaPicked 

FOR i » 1 TO NuaPicked 

IP KYIrcx(hatchCodes(Ficked<i)>.Oode>.Rua THEN 
MueExprUords - NusExprwords ♦ 1 

RED IN PRESERVE Expr Cedes (1 TO NuaExprWords) AS Cod ePoly Type 
ExprCcdes<NuaExprUorda).Code ■ totchCodes(PickedCi>).Code 

SasGettEl E)tprCodesCNuJiExprtfords).Poly, LEN(ExprCodes(1).Poly>, ExprCodes (RuaExprwords) .Code, PolySeoyENS 

ELSE 

IF LangS « "GERMAN" THEN 

DispHsg "Der Begriff »• + TeroS ♦ erscheint in ke^nea Ooxueent. LEERTASTE ua enderen Begriff einjug 

ELSE 

DispMsg "No documents that contain '" ♦ TeroS ♦ "' -ere found. Presa the Space Bar to enter a new Sea 

END IF 
UaitSpace 
3ispHsg 0, 0 

SOTO GetSearchTera ' get a new word to search for 

ENO IF 

NEXT _ ._ 

OriginalExprS » *• . 

SortT ExprCodes:':;, NuaExprWords, Descend, LEW C ExprCodes d» , 2, -3 
" FOR k = 1 TO NusExamords 

Code2ST- Expr. Sub Expr(CurrentSub). Phrase, k, ExprCodes (k). Code 
Origina.SxprS = OriginalExprS ♦ HKlSCExprCodes(k).Code) 

NEXT 

Expr. Natch » '1 ' reset search flag 1n full expression 

' when this flag Is -1, the progran 

■ knows that no valid search has been done 

ELSE ' none were pickeo 



ScrSR "R" 

GOTO GetSeerehTe-* 



ERASE tot encodes. Piekeo 

Expr.Nua ° CurrentSub ' set nuober of expressions to the current sub 
END SUB 

SUB AddSwaps (Expr AS Express ionTypa, Excludes, CollectO AS CollectType, RuaCollect) STATIC 
GlobalStatus « EditSearch 
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IF Expr.SunExprCD.Nua = 15 THEN ' can't add any core words 
Chime 4 

Oisprtsg "ERROR: Li ait of 15 Search Teras. No core Search Terra can be added. ?ress the Space Bar to continue:' 
ValtSpace 
oispnsg 0. 0 
EXIT SUB 

END IF 

DIM FC63 AS FreqCc*p63 
FC63UEM » LEMFC63) 

»— if not enough worts to run SWAPS then return 12 relatives 
IF Expr.SubExprCD.Nua < 2 THEM 

Code = St r2Ccde£(£jtpr.Sub£xpr<1). Phrase, 1> 
FGetRT Freq63FIL£. PC 63, CLNG(Code), FC63LEN 
i » 0: NuoColleet • 0 

DO 

ExelFlag c .FALSE 
t - 1 «■ * 

do not include words froa Excludes 
FOR ] = 1 TO LEMExeludeS) / 2 

I? Str2Code(ExcludeS, j) = FC63.Ccap<i> THEN 
ExelFlag ■ TRUE 
EXIT FOR 

END IF 

NEXT 

IF HOT ExelFlag THEN 

KusCollect « RumCoUect ♦ 1 

REDIH PRESERVE CollectO TO NusCollect) AS Collect Type 
col Lect(NuuCotlect). Code = FC63.Coap(i> 

END IF 

LOOP UNTIL NuaCsllect « 20 OR NuaOoUect = FC63.Nuo 
EXIT SUB 

END IF 

•IdealFreqi ■ <FileSi«*C5o*Dir$ *• *.NDX"> \ 8) ♦ .007 
DIN SynthTenp AS BitVaUe 
LenSynth = LEN(SynthTeap) 
FreeSpace& « FRE<-1) 

'If there is enough Descry then use low-aeacry otheruays use ENS 

ReaReqS = CLNS(Oictuorcs«w> * LenSynth 

IF HeaReqa ♦ 102* « FreeSsaceft THEN 1 leave 1K free 

REDIN Synthd TO SiettfordNua) AS BitValue 

Beta Flag • TRUE 

ELSE 

NumPages - MeaResi \ Sixteen* * 1 
EmaAlloc NunPages. SynthERS, "Swaps ENS Storage" 
•— clear <to 0) the E*S awry we allocated 
EHSPF » £osGet?=SegS 
FOR i = 1 TO NuaPases 

CALL Eas-aoneaCSyntheiS, 1,1) 'cap logical page i to physical pace 0 
CALL In-itPeatEHSPF, 0, 8192, 0) * clear physical page of oeaory to 2 

NEXT . 

HenFlag = FALSE 

END IF 

. - S g*?S ROUTIN E 

IF LangS * «<GERHAN- THE* 

DispMsg "Systea analyslert Eingabe auf verwandte Thccen' 1 , R, c 

ELSE 

Dispnsg "Systea is analysing query for related topics", R, c 

END IF 

'CPrint CHRSCCW 

•CPMnt " Code Freq Wore/?* rose f Relatives" ♦ CHRS(CR) 
DU1 Freq63 AS FreoCoap63 
FCLEN 3 LEH(Freq63) 
HumSynth » 0 

FOR i 3 1 TO Expr.SubcxsrCD.Nun ~ — — — 

Code - Sv2Code(S*or.SubExpr(1).Phrase, )) 
FGetRT Freq&SFZLE, Freq63, CLN6(Code), FC63LEN 
FOR k » 1 TO Fr.o63.Nua 

sua ail percentage values of words In FC List (really Relative list) 
1 which contain words froa the expression 

•**•* boost SWAPS 1/21/92 VN . . 

IF KeyIrstrX(Or1ginalExprS, KKISCCode)) ° 0 THEN 

AscFClnfoValue " ASC(Freq63.Value(k)) • Boost! 

ELSE 

•scFClnfoValue • ASC(Freq63.Value<k)> 

END IF 

IF RenFiag THEN 

Synth(Freq63.C0Bsp(k)). Value ■ Synth(Freq63.CoepCk>). Value ♦ iscfCInfoValue 
•— set bit to indicate that this word now has a value 
SetBit Synth(Freqo3.Coap(k)).Bit, j, 1 
'— save the code nuober 
SynthCFreq63.Coop(k».Cede - Frec63.Coop<k> 



ELSE 



EasGetlEl SynthTeap, LenSynth, Freg63.Co«pOO, SynthERS 
SynthTecp.VaLue a SynthTeap. Value ♦ AscFClnfoValue 

set bit to Indicate that this word new has a value 
SetBit SynthTesp.BIt, J* 1 
'— save the code nuaber 
SynthTenp. Code = Freo63.CoepCk) 

EasSetlEl SynthTeap, LenSynth, FreqfiS.CospU), SynthERS 
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• Generate final values for words 

•Debug = TRUE 

QueryLen = UBOUKO(ExprCooes) 
REDM Poly! (1 TO Query Len) 
MR i = 1 TO Queryten 

Poly'Ci) » <ExsrCodes(i).Poly / PolyAvg!) - .25 

NEXT 

FOR i = 1 TO DictUordNua 
'get value 
IF HeaFlag THEM 

LSET SynthTeap = SyntMi) 

ELSE 

EosBetlEl SynthTeop, LenSynth, i, SynthEHS 

END IF 

'Bake final calculation 

IF SynthTeap. Value. THEN 

EosGet PotyValue, PolyLEN, CU« I SynthTeop. Code), PolySeayEHS 
Teapft = SynthTeop. Value • CLHG(PolyValue. Value) 
Teapft = CcaisoSun&<SynthTeap.Bit, Teapft, -S", Poly'.O) 
•vvwvvvvvvvvvwvvvvvv^^ 

'boost words with ideal frequency <Rua0oc*0.007) 

FreoAatisi ■ ICf Iixix< SynthTeop. Code). Nua / XdealFreq! 
IF FrecRatio! > 1 THEM FreqRatio! a 1 / FreqRatio! 

IF Teooft > 32767 THEN Teapft » 32767 
SynthTeac .Value » Teapft 
•set velue back 
IF Hear lag THEN 

LSET SyntMi) - SynthTeop 

ELSE 

EwSetlEl SynthTeap, Len Synth, i, SynthEHS 

END IF 

END If 

NEXT 

'CPrint CHRS<CR) 

— exclude all words in the Excludes by touting their- values to 0 
FOR i = 1 TO LENCExcludeS) \ 2 
IF HeaFlag THEN 

SyntMS?r2CcdKExeludeS, i)).Value a 0 

ELSE 

EasGetlri SynthTeap. Len Synth, Str2Coda<ExeludeS, i), SynthEHS 
SynthTeao. value * 0 

EflsSetlSL SynthTeap, LenSynth, St r2Code( Excludes, i), SynthEHS 

END IF 

NEXT 

' sort the SynthO ty Value, in decreasing order 

IF HeaFlag THEN 

SortT Synthd), DictUordNua, Descend, LEMSynthCD). 0, -V -1*integer tort 

ELSE 

DIH 611 AS BitValue: DIM EL2 AS BitValue 

EosSort SEB Ell, SES EL2, LenSynth, CLHBCDictwordHua), 0, 0, 0, SynthEHS 
'. SortSwapEHS SynthEHS, DictUordNua 
END IF 

• add as aeny synthetic relatives as needed to fill out the box to 8 

• but add at least 6 of tnea no natter how many relatives there are. 

• this eeans there can be at aost 11 relatives returned 

• (5 regular and 6 syntnetic> 

SynthAdd a 20 

' add the aynth relatives to the Collect tist 

FOR i = 1 TO DictUordNua . 'SynthAdd 
. IF HeaFlag THEN 

LSET SynthTeop a SyntMi) 

EISE 

_ _ EasGetlEl SynthTeap, LenSyntfi, t, SynthEH3_ . - 

END IF 

IF SynthTeap. value > 0 THEN • add it 

IF mnox C SynthTeap. 05de).Nua > Unit AND NuaCoUect < SynthAdd THEN 
NuaCoUect = NuaCoUect ♦ 1 

REB1H PRESERVE ColleetCI TO NuaCoUect) AS ColleetType 
Collect (NuaCoUect). Code a SynthTeap. Code 
IF NuaCoUect a SynthAdd THEN EXIT FOR 

END IF 

ELSE * exit out, because the rest have 0 values 
EXIT FOR 

END IF 

NEXT 

DispHsg 0, 0 

IF HeaFlag THEN 

ERASE Synth 

ELSE 

EasRelHeo SynthE'S 

END IF 
END SUB 

FUNCTION Box inputs (Edit*. Titles, ProaptS, Row*, CoU, SeanX) STATIC 

' Displays a Box for input at upper-left location (Rom, CoU, 

• with the Title (if any) centered at the top of the Box, 
' end the Proapt displa>es oefore the text input area. 

• CALLS: Editor, Monitor*., 9ox0. ClearScrO, QPrtntSC 
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EditLen = LEN(EditS) 

Ulftov = RouX: UlCol = Cc'X 

LRRow a UlRow * 4: LB Cot = UlCol ♦ EditLen ♦ 5 

IF LEH(PresptS) THEN LRC=l a LRCot ♦ LEN(PreaptS) ♦ 1 

CALL ClearSerO(UlRow, ULCci. LRRow, LRCol, HiAttr) 
CALL DravBoxCUlRov, UtCel, LRRow, LRCol, 2, HiAttr) 
IF Title* <> — THEN 

^ CALL QPrlntRCCLnarsS ♦ Titles ♦ RKargS, UlRow, (UlCol *■ LIlCol - LEN (TltLeS) - 2) \ 2, HiAttr) 

IF LEN(Proapt$) THEN CAlL 5PrintRC(PraaptS, UlRou ♦ 2, UlCol * 2, HiAttr) 
CALL Draw9cw(UlRow ♦ 1, LRCol - EditLen - 3, LRRdu r 1, LRCol - 2. 1, HiAttr) 

LOCATE LRRow - 2, LRCol - EditLen - 2 

• Enlarge the cursor so it's easier to spot 
SELECT CASE Monitor* 

CASE IS < 3* Mono/Here 

LOCATE , , 1, 10, 13 
CASE IS a 3* C6A 

LOCATE , , 1,-7, 8 
CASE ELSE' VGA/EGA/HCGA 
LOCATE , , 1, 11, 14 

END SELECT 

CALL EditorCEditS, EditLen, Scan, 0, 0, HiAttr, HoroAttr, LRRow - 2. LRCol - Edlften - 2> 
LOCATE , , 0 

Box Inputs a 8TRIHS (LEFTS (£oi t$, EditLen)) 
END FUNCTION 

SU8 eulldConbTable (Nodes ) STATIC 

RED IN CoabTable(12 TO 20) AS SINSLE 

DIR DocFreqA AS LONG, DocFreqB AS L0N8 

IF HodeS * -S" THEN 

Divisor! - 42! 

ELSE 

Divisor! =23! 

END IF 

NuaKeys ■ LEN<ExprKeys$) \ 2 

REDIH Freq63(1 TO NudCeys) AS Freq£onp63 

FCLEN = L£NCPreq63(D) 

1 — only load the relative lists for the words in the expression 

FOR \ - 1 TO NuaXeys 

Code « Str2CodeZ(ExprKeysS, i) 

FGetRT Freq63FILE, Frcq63(i>, CUM Code), FCLEN 
NEXT * 
R - 0 

FOR i » 1 TO NuBKeys - 1 

FOR i « i ♦ 1 TO MucKeys 

DocFreqA = KYIndx(Freq63(1).Code).Nua 
DocFreqB a icrind3i(Freqfi3(]).Code).NuB 
RootofDocfreq! a SORUDocFreqA ♦ DocFreqB) / 2) 
Subscript a | * is ♦ j 
FOR k a 1 TO Freq63(j).Nua 

check If A was found as a relative In e's list 
IF Freq63(i).Code a Freq63(J).co&p(k) THEN 

' — the relative value for the AB pair is 
CoabTebleC Subscript) « ASC(Freq63(j).Vaiue(k)) 
EXIT FOR 

END IF 

IF CoabTebleC Subscript) = 0 THEN 
FOR k » 1 TO Freq63(i).Kus> 

check if A was found as a relative in B's list 
IF Freq63(j).Code ° Freq63(i).Coap(k) THEN 

CoabTable(Subscript) « ASC(Freq63(i).Value(k)) 
EXIT FOR 

END IF 

NEXT • 

END IF 

CosbTeble(Subscript) a CasfcTabletSubseript) • (RootofDocFreq! / RootrfAvgDoefreqU / Divisor! 

NEXT 

NEXT 

ERASE Freq63 
END sua 

FUNCTION CosboSuag (Sits, values, RodeS, Pcly»(» STATIC 

RED IN WordsO TO 13) 
Hultlpller! a o 

• find words shown by the on bits 

Huasbvords a 0 
Polyuordl a 0 

QueryLen a LEN(ExprKeysS) \ 2 
FOR BitNuab » 1 TO OueryLen 

IF GetBitXttitS, BitNuab) THEN 
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Kuabuordi a N u c M teffda + 1 

Polyuord! a Polyuord! ♦ PolyKBitNunto) 

(tords(NuttbUonls) = BltKuab 

END IF 

NEXT 

'»♦+* GueryLen = NuBbUords TEST: THY 4/8/91 



1 find all combinations of 2 words 

CoabSual = 0 

FOR i s 1 TO NuabUords - 1 

FOR j » i ♦ 1 TO KunfaWords 

TableNuab = CUordsCi) * 13) ♦ Uords<j) 1 which coabination of 2 

' — get value from table 

Pai rvalue! « CoabToble (TableNuab) 

'** 3/20/91: no more sqr root here, see 8uildCo«bTable for changes 
' coapute sqr root of value and divide by Divisor 
• with appropriate aax value 
'Pai rvalue! - S0R( Pair Value!) / Oi visor 

. " IF Mode* ■ "S" THEN 

SELECT CASE Queryten 

CASE 2: HaxPair! • .3 

CASE 3: HaxPair! a 1! 

CASE IS >= 4: HaxPair! • .9 

.CASE ELSE 
END SELECT 

ELSE 

SELECT CASE Queryten 

CASE 2: HaxPair! a .5 

CASE 3 : A HaxPair! =1.3 

CASE 4: HaxPair! =1.2 

CASE IS » 5: HaxPair! = 1.1 

CASE ELSE 
END SELECT 

END IF 

IF Pai rvalue! > HaxPair! THEN Pai rvalue! a HaxPair! 
CoobSua! a CoabSvo! ♦ Pai rvalue! 

NEXT 

NEXT 

' — Sunned aaxiaua values 
IF HodeS » "S" THEN 

SELECT CASE OueryLen 

CASE 2: RaxCoab! ■ .3 

CASE 3: RaxCoab! -1.4 

CASE 4: RaxCoab! • 1.8 

CASE 5: RaxCoab! = 2.3 

CASE IS >= 6: RaxCombt = 2.8 

CASE ELSE: RaxCoab! = 101 
EM> SELECT 

ELSE 

SELECT USE OueryLen 

CASE 2: RaxCoab! a .5 

CASE 3: RaxCoab! a 1.6 

CASE. 4: RaxCoab! a 1.9 

CASE S: RaxCoab! « 2.3 

CASE IS >= 6: RaxCoab! « 2.8 

CASE EL55: RaxCoab! = 10! 
EKD SELECT 

END IF 

IF CoobSua! > RaxCoab! THE* 
Coa bS ua! = RaxCoab! 

END IF 

• Dodify value using foroulo 

IF NuabVords > 1 THEN ~ _ 

' new value a old value * foraulo calculated auitiplier 

' Value ■ sua of weights * (# of words - sua of calculated values for all pairs) 

Power! a PolyVcrd! - CoabSua! 

IF Power! > 1 THEN ' 

IF HodeS a -s» THEN 

Rultiplier! a 2 " Power! 

ELSE 

Hultiplier! a 1.8 - Power! 

END IF 

END IF 

csJ a Value* • HultlpHer! 

IF ciO > 2147683647 THEN CS0 a 2147483647 

f f tn hnfr H i = caff 

ELSE 

cs& a value* 
CoaboSuaS 3 cs& 

END IF 

END FUNCTION 

SUB Oeletevord (Expr AS Express lonType) STATIC 
GlobalStatus a EditSearch 
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IF Expr.SubExpKD.HuD = 0 THEN 
Chios A 

Dispflsg "ERROR: No Words to Delete. Press the Space Bar to continue: % R, c 
WeitSpace 
DispHsg 0, 0 
EXIT SUB 

END IF 

GOSUB DispProfflpt 

IF LangS <> "GERJIAN* THEN 

call Referenees(~> 

DO IF 
DO 

DO 

KeeS • INKEYS 
LOOP UNTIL LEN(KeeS) > 0 • 
Kee • ASC(UCAS£S(aiGHTJ((KeeS>, 1))) 
IF Kee * F10 Th~S 

IF LangS = "GERMAN" THEN 

iS = OuestionS< w NEUS SUCKS ? CJ/N] "JM". •AChTjHS I") 
IF i$ = »J« THEN 

Global Status = NewSaarch 

EXIT sue 

END IF 

ELSE 

i$ = QuestionSCNew Search? CY/M\ "W, "WARNING:"} 
IF i$ — "Y* THEN 

Global Status = NewSearch 

exit sua 

END IF 

END IF 
ELSE IF Kee « ESC THEN 

EXIT DO 

ELSE IF (Kee ASCO ♦ 1 AND Kee <= ASC9) OR (Kee >= AscUpperA AND Kee <» AscUpperA ♦ Expr.SubExpKlKNua - 9) THEN 
n = VAu«eeS> 

IF n * S THEN n * Kee - 55 'so A-10, BM1, etc. 

IF n » Ejtpr.SubExprm.Nua THEN 

* — user chose a word that doesn't exist 
SEEP 

ELSE 

' — get coda of chosen word 

Code = Str2Code<Expr.SubExpr(1).Phrast, n) 

'— save the deleted word in a last-in-f1rst-out stack 

' for use in undeleting 

LIFOS s UFOS + ftKIS(Code) 

' — now delete the word by noving the words ahead of it 

up one sLot 
ExprCodesCO.PolY = 0. 
rz* 1 = n TO &cpr.SuoExpr(1).Kum - 1 

Code2Str Expr.SubExpKI). Phrase, 1, Str2Ccde(Expr.SutExpr(1). Phrase, 1*1) 

Code2Str Expr. Sub£xpr<1 ) . Phrase, 1 ♦ l, 0 

LSET ExprCodes(i) * ExprCodesli 4 1) 

NEXT 

' — decrement word counter 

Expr.SubExprtD.Nua ° Expr.SubExpr<1).llua - 1 

NusExprUords 3 NuoExprUords - 1 

IF Expr.SubExprC1>.Nuo » 0 THEN Expr.Nua - 0 

5*pr. Batch • -1 Reset to No full search ; et 

E«pr. SubExcrCD. Batch « -1 Reset to No search yet 

EXIT DO 

ENO IF 

ELSE 

— ' — invalid keypress _ _ ... 

BEEP 

END IF 

LOOP 

DispBsg 0, 0 
EXIT SUB 



DispProapt: 

FOR i = 1 TO Blnlnt(9, Expr.SubExpr(l).Nua) 
OPrintRC " \ 3 ♦ 1, *. NonaAttr 

OPHntRC STRJC1), 3 ♦ i. 6, RevAttr 

NEXT 

FOR i • 10 TO Cxpr.SubExpr(l).Hufl) 

OPrintRC • 3 ♦ i. A, BoroAttr 

OPHntRC ■ • # CwaCAseUpperA - 10 * 1). 3 4 i, 6, RevAttr 

NEXT 

IF LangS a -GERMAN - THEN 

Dtspflsg "lanl ocer Buchstabe ub Wort zu Loeschen R, c 

ELSE 
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DispHsg "Enter a Nwaber or Letter to Oelete word/Phrase or Press ESC to Cancel", R r c 

END IF 
RETURN 
EM> SUB 

SUB DrawBox (UUtow, UlCol, LRRou, LRCol, Frame, Col) STATIC 
IF TeraTypeFlag THEM 

CALL BoxOCUlRow, UlCol. LRRow, LRCol, Fraae, Col) 

ELSE 

QPrintRC "«•* * STRINGSllRCOl - UlCol - 1, "-") ♦ ULRou, UlCol, Col 

FOR i = UlRo* - 1 TO LRRow - 1 

QPriirrRC "|\ 1. UlCol, Col 

OPrlntSC -|", 1, LRCol, col 

QPrintRC - STRINSS(LRCol - UlCol - 1, "-") * LRRow, UlCol, Col 

END IF 
END SUB 

SUB FindRelatives (Expr AS Express ionType, Subttus, Ptr, RelC) AS Collect Type. HuuRel. Excludes) STATIC 
SubNua = 1 

DIM ReiTeep AS Relati ve T ype' temporary storage tor retrieval froa EHS 
REOW Colleetd TO 1) AS Collect Type 
NuoCollect ° 0: Added • 0: KuaRel = 0 

ExprKeysS » mn 

HuaExprKeys ■ 0 

FOR i = 1 TO Expr.Nua 

ExprKeysS = Exsr<eysS ♦ lETOCExpr.SubExpKi) .Phrase, Expr.Sub€xpr(i).Kua * 2> 

NunExprKeys « !«jt£xprKeys * Expr.SubExpr(i).ttua 

NEXT 

• exclude list is s;l word froa query plus parts of coabkw plus parts of 

• words with prefixes. . 
Excludes a ExprKeysS ♦ ExsludeAddS 



SUB FullText (Firsts, Lsstt. FUeNun, exprS) 



FullText. bas - Nov 5, 1990 CoapLex . 

. CREATE ARRAT1 (POINTERS) FROM DICT.CO0 & SORT BY CODE 

. FIND ALL SYN. KATCO* CODES & PUT INTO ARRAYS 

. READ TEXT LINES PRO* FILE INTO ARRAY2 

. CHECK FOR SYN. RAT0-2S FOR All WORDS IN LINES 

. DISPLAY TEXT UNES 

. HI LITE WORDS flATCOC SYHS. IN ARRAYS 



LinCount ■ Lost* - First* ♦ 1. 

If LinCount > 1000 THEN LinCount » 1000 1 •axieua fitting in 128K 
DIN DictTenp AS DictTyoe 1 var. to read ess rec. into 

DictLEN - LEMDictTeap) 

DIN Array3SO TO 1) contains all syns. 

RED IN Arrayid TO DictCodeNus) AS ArraylType 

EasZArray Arrayl (1), LEXCArrayld)), DictCodeNuo. ArraylEHS 

• [ find an synon. for all search words ft place into array3 1 

FOR Counter = 1 TO (LPi(ExprS) \ 2) 

This Code a St rZCoceXC ExprS, Counter) 

• binary searcn arrayl... 
1 = 1 

R a DictCodeNua " 
DO 

ThisELeaent « (CLNGU) ♦ R) \ 2 
IF ThlsCOOe < Array! (Thi U iemenX ) . Code«u» "DSN 
R s TnisEleaent - 1 

ELSE 

I ■ ThisEleaent ♦ 1 
END IF • 
LOOP UNTIL ThisCode • Arroyl (ThisEleeent) .Codeffca OR I > R 

■ if found... 

IF ThisCode - Arrey1<ThisEleaent).CodeNue THEN 



»— bee cup until First i 

00 UNTIL ArrayKThisEleaent - D.CodeNua <> ThisCode 
ThisEleaent « ThisEleaent - 1 



LOOP 



•— caov all syns. into array3S 
DO UNTIL ThisCode <> Arrayl (ThisEleaent). 

SasGetlEl DlctTeap, LEN(DictTeap), Arrayl CThisEle«<nt).Recteia. DictCodeH 

!AjsCfSyns 3 NuaOfSyns ♦ 1 

RED U! PRESERVE Array3S(1 TO NuaOfSyns) 

ju»ray3S(Kua0fSyns) » RTRINSCDictTeap.Str) 
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ThisEleaent * ThUElettent ♦ 1 

t? ThisElecent > PlctCoddtua THEN EXIT DO 

too? 

END IF 

NEXT 

ERASE Arrayl ' don't need 1t anynore, and it's still in EMS 

era in T»xt(1 TO LinCoirt) AS STRIKG * 80 1 contains text lines 
REDM Irr»y4(1 TO Llntoun^. 1 TO 30) M-Mtt WT* »«f. 

IF NueOf Syns « O TH£> 
Chioe 10 

IF LangS « "6ER.W THEN „ , « « 
OPrintHC -Nuober of syns » 0", 1. 1. -1 
ELSE 

OPrintBC -Huaoe- of syns • 0", 1. 1. -1 
END IF 
STOP 
•END IF 

RED IH Array5$C1 TO NuCfSyns, 5) ' contains all s^s (parsed) 

«DIM nrray6C1 TO NunCfSyns) ' paraLLal to Array5: i of words in row 

FOR Counter = 1 TO NuXJf Syns 

Ar^S(Sw-!*0) = STRS<IfCountCRTRIHS(Array3S(Oounter)). • ") ♦ 1) 
!r^CcXer> = InCcwntCRTRim(Arraytt(Counter)), " "> ♦ 1 

FOR Counterl * * TO VALtArraySS (Counter. 0» , ^ . 

CALL Extract CArray3SC Counter), • % Counterl. StrtSyn, SUnSyn 
Arr»y=$<Counter, Counterl) * MIM<Array3$(Counter>, StrtSyn. SlenSyi) 

NEXT 'counter*; 

NEXT 'counter 

. [ read text tines fn» file into array2 > 

ThisLine * 0 

FOR Counter* => Firsts TO Last* - 
ThisLine = ThisLine ♦ 1 
IF ThisLine > LinCount THEN EXIT FOR 
GET FlleKua, Counter*, Text (ThisLine) 

Array4(ThisLine. 1) - -1 ' indicates this tine was not checked 

NEXT 



-C display text Lines & poll for avai I. keys J- 



Line?tr = 1 * Set Line pointer 

PrevLinePtr - 0 

NotNeuSearch: 

00 

IF LinePtr «» PfevLinePtr THEN 
PrevLir^tr = LinePtr 



• Update the 2* Unas of text 

IF Asclrcee <> UP AND Asclnkee <> Dn THEN 

• Print Information bar at bottoa 
IF TereTyoerlodeS = * LOCAL* THEN 
LeftCharS =» CHRSC26) 
R'ChtCharS ■ CHRK27) 

Ex3CharS »_°F2 , L_ 

DirOocCharS = "FS* 
ycCharS - CHBS(20 
SnCharS = CHRSC2S) 
sgUPCharS ■ "PgUp, " 
PgOnCharS = "PgDn, " 
ric-eCharS = "HOME* 
EndCharS = "END" 

ELSE 

LeftCharS = CHRS(LeftArrouKey) 
?ightCharS = CHRS(RlflhtArrowKey) 
SxpCharS = CHRS(ShowEXprKey) 
DirDocCharS * CHRS(DlrmuSCey) 
joCharS s CHSS(UpArrovKey) 
>.CharS = CHR$(0ounArroykey) 
»=uPCharS » CWW(PgUpKey) ♦ ": PgUp. * 
ssonCharS = CHMtPgDnXey) ♦ ■: PgDn, " 
**»CharS = CHRS(HoaeKey) 
EMCharS = CHRS(EncXey) 

END IF 

" ^L^teK ^hars ♦ Sucneingabe" ♦ LeftCharS ♦ •: NaestStes Ook - ♦ Ri*tCherS ♦ Vorhergeh Dok - ♦ 01 
ELS£ InfolineS - UpCharS + •: Up,. " ♦ OnCharS ♦ ": Down. - ♦ P*FChar$ ♦ PgOnCharS ♦ HoeeCharS ♦ Top. - ♦ EndCha 
END IF . 
CALL References(InfolineS) 

END IF 

IF TemT,t«!lDdeS - ■LOCAL" THEN 
3 « 23 

ELSE 

9-22 
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END It 

FOR i " 3 TO H 

ThisLine « i * LinePtr 

IF Th-iaLirte <■ LinCount THEM 

* print, line with noraa I attrib. 

QPrintRC Text (ThisLine), i ♦ 1, 1, NoraAttr 

' if this line not searched yet 
IF Array*(ThisLine, 1) < 0 THEN 

IF Arroy4(Thi*Line, 1) • -1 THEM 

Arrey*(Thi aline, 1) « 0 • indicates this line was checked 

ThisVordNua » 1 
ELSE ' if less then -1 

ThisUordNua = ArrayAOhisLine, 1) A -2 • 0 of words already checked 

Array4(ThisLine, 1) « 1 

END IF 

Niuuords = InCountX(RTRliiS(Text(ThisLine» t • "> ♦ 1 

LineTapS * LCASES(Text(ThisLine» 

GO sua FilterUne 

Count er2 = ThisuordNue - 1 
DO 

Counter^ .= counter? ♦ 1 

CALL Extract (Li neTepS, " % Counter*. Strt, Slen) 
•' CurrUordS ° HlDSdineTapS, Strt, Slen) 

' not a valid word so goto next word ft toe. word count 
IF LBKQirrWorOS) = 0 THEN « 
IF Ruawords « 80 THEN 

NuaUords = NuaUords * 1 

END IF 

GOTO LoopCounter2 

END IF 

FOR Countar3 = 1 TO NuaOf Syns 

' KusSynUords = VAL(array5S(counter3, 05) ■ test new array *•• rea for spe 
CurrSynuordS ~ Array5S(Counter3, 1) • test new array *** . . 
SlenSyn = LENCCurrSynUordS) 

* IF Array6<Counter3) ■ 1 THEN ' if single syn. word 
IF RI6Hn(CurrSynUorcS, 2> » •//" THEB 
SlenSyn » SlenSyn - 2 

CurrSynuordS = LEFTS < Cu rrSynUordS , SlenSyn) 

Hatch s (CurrSynuordS = CurrUordS) 
ELSE IF BIO4TS(CurTSyra0fx», 1) a ~/° THEN 
SlenSyn a SlenSyn - 1 

CurrSynuordS * LtrmcurrSyhuordS, SlenSyn) 
Hatch » (CurrJordS = CurrSynuordS) 

ELSE 

Hatch a (CurrSynuordS = LEFTS C CurrUordS, SlenSyn)) 

END IF 

IF Hatch THEN * aoc to array* 

Array4(TlmLire, 1) » Array4<TbisLlne. 1> ♦ 1 
ArrayA(ThijL-fi, (Arrey4CThisline, 1> * 2)> = Strt 
Arrey*(Thi*L-Re. (ArrayMThisLine, 1) * 2) ♦ 1) ■ Slen 

END IF 
ELSE * if coabo syn. 

FirstStrt * Strt ' indicates where to start hi-Ute word group 

_ . • FOR Counter* = 1 TO Array6CCounter3) 

— IF Counter4.-» rTHEN * — 

* get next word in line 

This'^srdNus » Counter2 ♦ Counter 4 - 1 

ThiiLineTap = ThisLine 

IF ThirJordNua > NuaUords THEN 

ThisUordNua.a ThisVOrdNua - NuaUords • 
ThisLineTap a TbisLina ♦ 1 
IF ThisLine >= LinCount THEN 
Hatch 9 FALSE 
EXIT FOR 'count art 

END IF 

LineTapS = LCASES(TexttTMsLineTep)) 
GOSUB FilterUne 

END IF 

. CALL UtractCLineTapS, " ', ThisUordNua. Strt, Slen) 
CurrvcrdS - LCASESOUDSCLineTopS, Strt. SlerO) 
' get next syn. 

CurrSynuordS - ArraySM Count er3. Counter*) • test ne 
SlenSyn - LEH( CurrSynuordS) 

END IF 

IF RIOfTS(CurrSyn*ordS, 2) « ■//" THEN 
SLer.Syn » SlenSyn - 2 
. CurrSynuordS » LEFTS ( Cur rSynUordS. SlenSyn) 

Batch « (CurrUordS = CurrSynuordS) 
ELSEIF RIOfTi:CurrSynUordS, 1) =» THEN 
SlenS/n « SlenSyn - 1 

CurrSynuordS a LEm( CurrSynuordS, SlenSyn) 



05/25/2004, EAST Version: 1 



5,404,514 

255 256 



Rater. « (CurrttordS « CurrSynttordS) 

ELSE 

natch « (CurrSynltordS = LEFTS (CurrttordS, StenSyn)) 

END IP 

IP CurrSynUcrdS = "3" THEN 

IF isSTRUtList*. ■/" ♦ CurrttordS ♦ V"> THEN 
Hatch s TRUE 

ELSE 



natch = FJU.SE 

"END I? 

END IF 

IF Hatch s FILSc THEN EXIT FOR 
NEXT 'eountorA 

IF Hatch THEN ' aad to array* 

IF ThisLineTao > ThisLlne THEN ' ctato on 2 lines 
• 1*x Une 

SleriLl a 80 - PirstStrt 
StrtM » FirstStrt 

Arra»A<ThisLine, 1) • Array4(TMsLine, 1> ♦ 1 
Arra"y*<ThisLine, <Array4(ThisLine, 1) * 2)) ■ StrtLl 
Arrav4<ThisLine, <Array4(ThisLine, 1) * 2) + 1) • Slen 

' 2nd line 

SlenU - CStrt ♦ Slen) - 1 
Strt;2 * 1 

Arra»4<7MsLin«Tap, 1) - TnisUerdNusi * -2 ' I of words 
Arra^CThisLineTep. 2) a StrtU 
ArraUcThisLineTop, 3) = Slenl2 



ELSE 



ELSE 
EN 3 IF 



Slen = CStrt ♦ Slen) - FirstStrt 
Strt = ?1rstStft 

Arra»tCTh1sLine. 1) = ArrayiCThisLine. 1) ♦ 1 
ArraUtTMsLine. (Array« ThisLlne. 1) * 2)) * Strt 
ArraUcTMsLlne, (Array4<ThisLine, 1) * 2J ♦ 1) ■ Slen 



CALL Extract CLineTacS. * \ Count er2, Strt, Slen) 
CurrUordS = MDSaif«7*pS, Strt, Slen) 

EKD IF* was it a coob or sin; keyword? 

NEXT 'counters 

LOOP UNTIL Counter2 ■ NumUords 

END IF * 

• hi * lite words in Une 

IF ArrayMThisLine, 1) > 0 THEN 

FOR Countcr2 » 2 TO (Array4<Thi*Line. 1) * 2) STEP 2 

HLUordS - HlDS<Text<ThiaLine), Arra>A(ThisLine, Countar2); ArrayUThisLine, Counter2 ♦ 
QPrintRC HLUordS, 1*1, Array4(ThisJre, Counter2)„ RevAttr 
NEXT 'counter2 

END IF 

QPrintRC SPACE$(80>, I ♦ 1, 1, HornAttr 



NEXT i 

END IF 

1 wai t for a *ey to be pressed 

DO 

InkeeS = INKEYS 
LOOP UNTIL LENC InkeeS) 
IF UNC InkeeS) = 1 THEN 

Asclnkee = ASCCUCASESUnkeeS)) 
ELSEIF UN (InkeeS) s 2 THEN 

Asclnkee = ASCC RIGHTS (InkeeS, D) ♦ 200 

END IF 

SELECT USE Asclnkee 

CASS UpArrouKey. UP 

IF LinePtr > 1 THEN 

LinePtr • LinePtr - 1 

END IP 
CASE DownArrowkey, Dn 

IF LinePtr « LinCount THEN 

LinePtr » LinePtr ♦ 1 

END IF 
CASE PgUpKey, PGUP 

IF LinePtr > 1 THEN 

LinePtr " LinePtr - R 
IF LinePtr « 1 THEN 
LinePtr « 1 

END IF 

END IP 
CASE PcpnXey, PGDN 

IF LinePtr LinCount - R - 1 THEN 
LinePtr ■ LinePtr ♦ R 
IF LinePtr > LinCount THEN 
LinePtr » LinCount 

END IF 
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END If 
CASS HoaeKey, W* 

IF LinePtr > 1 THEN 
LinePtr * 1 

END IF 
CASE EndXey, EN 

IF LinePtr « LinCount - a THEN 
LinePtr = LinCount - R 

END IF 

CASE ESC, RiohtArrowKey, LeftArrowXey, DirNuaKey, ShovExprKey, F2, F3, F4, FS 
ExitFlag.= Asclnkee 
Qui tF lag « TRUE 
CASE KevSearchKey, F10 

IF LangS o "GERMAN" THEN 

iS » QuestionSCNEUE SUCHE ? CJ/NJ MN", "ACHTUNG I") 
IF IS = "J- THEN 

ExitFlag a Asclnkee 
QuitFlag a TRUE 

ELSE 

GOTO Not Research 

END IF 

ELSE 

IS a QuestlonSCBeu Search? ir/Nj", "TH-, *V«*INS!") 
IF IS = "V THEN 

ExItFlag a Asclnkee 
QuitFlag = TRUE 



END IF 

CASE ELSE 
END SELECT 
LOOP UNTIL QuitFlag 



ELSE 

IF 



GOTO NotNevSearch 



-C free up memory 3— 



DO 



ERASE Text • -held the eccuaent text (fixed. ten string) 
ERASE ArrayM 
ERASE Array* 
ERASE ArraySS 
ERASE Array© 

EXIT SU3 



FilterLine: 

« replace all *s tftm spaces 

8 1 

PuncPos « INSTR ( PuncPos 0 LineTapS, "'s"> 

IF PuncPos » 0 THEN 

NIDSCLlr.eT«pS, PuncPos, 2) » ■ " 
PuncPos ■ PuncPos ♦ 2 

END IF 

LOOP UNTIL PuncPos - 0 

' replace all punctuations with spaces 

FOR F counter - 33 TO 37 

: CALL ReplaceCharCLineTapS, CHRSCFcounter), 
NEXT 'Fcounter 

FOR Fcounter » 39 TO 46 

CALL ReplaceCharCLineTapS, CHRSC Fcounter), 
NEXT 'Fcounter 

FOR Fcounter = 58 TO 64 

CALL RjplMeCharCLlneTBpS. CHWCFeowiter), 
NEXT 'Fcounter 

RETURN 



SUB LoadPrefixes CPrefixesSO, HeanPreflxesSC). LangS) STATIC 

IP LangS « "GERMAN" THEN 

RESTORE Gc f cnPrcfixcs 

ELSE 

RESTORE English Prefixes 

END IF 

RED IN Pref 1xesS(2 TO 9) 

IF LangS = "GERMAN" THEN 
FOR 1 3 2 TO 9 

READ FirstHalfS, SecondHalfS' 
PrefixesMi) o FirstHalfS ♦ SecondHalfS 

NEXT 

ELSE 

FOR. 1 = 2 TO 9 

READ PrefixesSCi) 

NEXT 

END IF 

RED in NeanPref1xesS(3 TOW) 

IF LangS o "GERMAN" THEN 
FOR 1 3 3 TO 14 

• READ FirstHalfS; SecondHalfS, TMrdKalfS 

ReanPreflxesSC i) o FirstHalfS ♦ SecondHalfS ♦ ThlrdRalfS 

NEXT 
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FOR i = 3 TO H 

READ HeanPrefixesSCi) 

NEXT 

END IF 
END SUB 

SUB Otherwords <Expr as ExpressionType, wordPtr) STATIC 

• Continue finding and selecting Relatives/SwAPs until none are selected 

REDin RelativesO to 1) as collectType 

findRelatlves Expr, CurrentSub, UordPtr, RelativesO, NunRelatives, Excludes 
ShowExpr Expr 

IF Expr.SubExpr(1).Nua > 1 THEN 

' BuUdCosbTable C'S") indicate SWAPS is the caller 

END IF 

AddSwaps Expr, Excludes, RelativesO, NunRelatives 

HumSelected » SelectRelativesX(Expr, CurrentSub, RelativesO, Nuafielatives) 

IF Global Status « NewSearch THEN EXIT SUB 

* Display inforaatlon using ShowExpr 

ShowExpr Expr 

END SUB 

SUB PickChoice <m() AS CollectType, NunM, Labels, PickXO, KunPick, Exact Flag, Expr AS ExpressionType) STATIC 
SHARED Cnf AS Ccnfig 

'PickChoiet allows the user to choose one or. more of the keywords fron 
' the list of keywords passed in no as codec's. It places the choices 
1 directly into the SubExpr. Phrase. 

•NOTE: if there Is only 1 choice passed to this routine, then 1t will 

• automatically return that choice as the one chosen, without 

• Interacting with the user 

' first set up the array of keywords from the list of code 0's 

NewSearchFlag « false 

DO 

REDIN UordS (0 TO NuoK) 
WordS(O) ■ Labels 
HaxVordLen = L£N(wordS<0» ♦ 2 
FOR i * 1 TO NumH 

VordS(i) - DlctSCa(i).Code) 

If LEH(Word$(i)> > haxUordLen THEN 

HaxVordLen s LEN(UordSCi)) 

END IF 

NEXT 

FOR i = 1 TO Nuaft 

SpaceUn = NaxIntX (HaxUordLen, LEN (Labels) ♦ 2) - LEH(UordS(i)) ♦ 2 
tfordSd) - UordSCi) *.S>ACES(SpeceLen) 

NEXT 

•IF Nuiitfi a 1 AND (Global Status = NewSearch OR GLoba I Status » AddUorda) AND ExaetFleg • TRUE THEN 
■ only one choice (during a Tens Search or a Narrow Search) 
1 so return it automatically without a menu or a keypress 
NuaPlck « 1 
PlckX(1) = 1 
CALL CMae(6) 



asE 



• push the window location to the lower right 
IF TeraTypeModeS = "LOCAL" THEN 

LO CATE 24 - H in lntf CNu nn, 12) - 3. 80 - HaxUordLen - 7, 0 
"ELSE ' " 

LOCATE 24 - «inlntX(NunH, 12) - 4, 80 - NaxtordLen - 7, 0 

END IF 

1 call e <List to get choices (total aaxioum of IS allowed in Sub£x=-esaicn) 
If TeraT /S eHodeS * "LOCAL" THEN 
IP LangS = "GERMAN" THEN 

CALL References("OANACH ENTER TASTE " ♦ CHRS(24> * CMRSC25) ♦ • PGUP, PGDN") 

ELSE 



ELSE 



CALL Rafertnces(CHRS(24) ♦ UP, ■ * CHRSC2S) ♦ ": DCWI, PGUP, PGDN, SPACE BAR; SELECT, ENTER: DONE. 

END IF 

IF LangS = "GERMAN" THEN 
^ ^ CALL References COANACH ENTER TASTE " *■ CH8S(24> * CMS (25) ♦ ■ PGUP, PGDN") 

CALL ReferencesCU: UP, X: DOWN, 6; PGUP, C: PGDN, SPACE BAR: SELECT, ENTER: DONE, ESC: CANCEL") 

END IF 

END IF 

i^?*^ -^rd^*' ™* **' Shclxprtty, HeaeKey, EhdXay. PgUpKey, PgDnKey, UpAr 

IP LangS « "GERMAN" THEN 

IS = OuestionSCNEUE SUCHE ? U/H) \ MM", "ACHTUNG !") 
IP iS = "J" THEN NewSearchFlag a TRUE 

ELSE 

IS - Questions ("New Search? CY/W\ "W, "WARNING!") 
IF 1S » THEN NewSearchFlag » TRUE 

END IF 

END IF 

END IF 

LOOP UNTIL NuaPick <> -10 OR NewSearchFlag 
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ERASE Word* 
LOCATE , , 0 
END SUB 

SUB RankRecords (RecsC) as RecInfoType, NuaFoundZ, Expr AS ExpressienType) STATIC 

DM AvgRankTot AS AvgRancType 
QUI KYlnfo AS KeylnfoLONG 
KYInfoLen s LENCKYInfo) 
HaxUidth a INT(SO».(NuaReeords«)) ♦ 1 

RED IH RecordCI TO HaxUidth, 1 TO MaxUldth) AS STRING * 1 

at locate an EMS array tor the document ranking 
Din RankTot AS Rank Info. ELI AS Ranklnfo, E12 AS Rank Info 
DIM BlankRankTot AS Ranklnfo 
RankTot LEW = LOU RankTot) 

KumPages = KUBReeordsg * RankTotLEN \ SixteenK ♦ 1 
EissAUoc NuaPages, RankTot EflS, "Ranking Totals EMS Storage" 

• allocate an EMS array for the sunned values 

DIM Value AS BitValoeLong 
Din Blank Value AS BitValueLong 
ValueLEN = LEN(Value) 

KuaPages = NuaRecordso * ValueLEN \ SixteenK ♦ 1 
EbsAUoc KuaPages, RankEMS, "Ranking EMS Storage" 

c i tar (to 0) the EMS aenory we allocated 
EHSPF = EmsGetPFSegX 
FOR 1 » 1 TO NumPages 

CALL EnaMepHeotRankEHS, 1,1) ' oap logical page i to physical page 0 

CALL In1tHea<EH3PP. 0. 8192. 0) 1 clear physical page of memory to 0 

NEXT 

IF LangS • "GERMAN" THEN 

DispNsg "GEVICHTUNG OER GE FUND EN EN DOKUHENTE 0, 0 

ELSE 

DispNsg "Ranking Ooeuaanta " § 0, 0 

END IF 

a 

ExprKaysS c - 

OuaryNua « 0 

FOR i = 1 TO Expr.Nua 

ExprKeysS = ExprKaysS + LEFTSlExpr.SubExpKi). Phrase, Expr.Sub£xpr(i).Nua » 2) 

OuaryNua = OueryNua ♦ Expr.SubExprlD.Nua 

NEXT 

BuildCoabTabLe C"R") indicate RANKIng Is the caller 

QueryLen = LB* C ExprKeysS) \ 2 
RE DM Polyld TO QueryLen) 
FOR 1 = 1 TO OueryLen 

PclyUD = (ExprCodesCi).Poly / PolyAvg!) * .25 

NEXT 

NuaRanU = 0: Empty Flag « FALSE : 

Put I Search: 

FOR i - 1 TO Expr.Nua 

FOR j - 1 TO Expr.SubExprCD.Nua 
NaaFlag = FALSE 

Code = Srr2Code(Expr.SubExprCi). Phrase, j) ' 

•EasGet PolyValue, PolyLEN, CLNG(Code), PolyScayEMS '** THY 5/13/91 
FreeSpaeeft » FREC-1) 

*if there is enough aenory then read the whole block 

HeaReqa « ONG CKYIadxC Code). Hun ) • KYInfoLen 

IF FreeEpaceS > itenReqS ♦ 1024 THEN 1 Leave IK free 

— REDin KYInfoArraytl TO KYlncfct<Code) .Nua) AS KeylnfoLONG _ - — 

■FGetA can read only 64K, so if we have acre, read it in t*o steos 
IF RecReo& < Thirty TvcKS • 2 THEN 

FSeek KYInvertDatFILE, CKYlndx(Code). Index - 1) * KYIrfoUn . 

FGetA2 KYInvertDatFILE, SEG KYInfoArrayCD, CLNGlKYitStCCodeJ.Nua) * KYInfoLan 

ELSS 

FSeek KYInvertDatFILE, Ocrindx (Code). Index - 1) * KV-tfcLen 

FGetA2 KYInvertDatFILE, SEG KYInfoArraytl), CLNG(KYIrtu(Code).Nua \ 2) * KYInfoLen . . 

FSeek KYInvertDatFILE, CKYIndx (Code). Index - 1) • KYUvfcLen ♦ CLNG CKYIndx(Code) . Nub \ 2) » KYInfoLen 

F6etA2 KYInvertDatFILE, SEG KYInfoArraylJaino*CCode).*uB \ 2 ♦ 1). CUCQCYlndxCCodeJ.Nua \ 2 + KYlodxC 

END IF . 
MesFLag ° TRUE 

END IF 

. FOR !nde*3 - 1 TO KYIndx(Code) .Nua 
IF MeoFlag THEN ' 

LSET KYlnfo » KYInloArray UndexS) 

ELSE 

FGetRT KYInvertDatFILE, KYlnfo, KYIndx(Code). Index - Index* - 1, KYInfoLen 

ENO IF 

Horir » KYlnfo. Rec HOD RaxUidth 

Vert « INTCKYInfo.Rac / RaxUidth) ♦ 1 

IF HoHz a 0 THEN Kori* a RaxUidth: Vert ° Vert - 1 

' give a boost to the original words in expression 

IF KeylnstrXCOriginaUxprS, WCIS(Code)) THEN 
KYlnfo. Value o KYInfo.Valua • Boost! 

END IF 

clear Value variable 
Value ° BlankValue 
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check against first appearing only for (ton rare words 
IF Gueryffua * 3 AND KYlndx(Code) .Nua > Liait * 5 AND NOT Bar:/? lag THEH 

•to pick up documents even there is only one keywcrcCfor s*all set of docueents) 
•change this block to GO SUB Kecpftec 
•vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvw 

• check if this is first appearance 

If ASC<Reeord(Hcria, Vert)) = 0 THEN 
RecordUtoHi, Vert) » »1» 

Value. Value * KYInfo. Value * PolyValue. Value 
SetBIt value.Bit, j. 1 
EosSet Value, valueLEN, KYlnto.Rec, Rank Eh S 

ELSE 

end ir 



ELSE 

GOSUB Keepflec 
END IF 

NEXT ' — next document » 

If LangS = -GERMAN" THEN 

OPrintRC - ",12, 55, -1 

ELSE 

OPrintRC " ',12, 47, -1 

END IF 

IF NeaFlag THEN ERASE KYInfoArray 
NEXT' — next keyword in expression 
NEXT'— next expression (always 1) 

•—if there is no documents which contain no re then one word (very rare case) 
IF NiraRank& » 0 AND NOT EaptyFUg THEN 

Empty Flag * TRUE 

GOTO FullSearch 

END IF 

IF NueRankft = 0 THEN 'no documents matched at all 
Chine 6 
NuaFound ■ 0 
Disphsg 0, 0 

toitSptca S ° rry ' there are 00 docuaenta "Aching your query. Please aocify your search query. Press Space bar to ccntimi 

DispHsg "\ 0, 0 
EXIT SUB 

END IF 

FOR iS a 1 TO NuaRankX 

IF LangS * "GERMAN" THEN 

OPrintRC STRSCiS), 12, 55, -1 

ELSE 

OPrintRC STRSC1&), 12, 47, -1 

END IF 

EasGet Rank Toe, RankTotLEN, 1&, RankTotEHS 
EasGet Value, ValueLEN, RankTot.Rec, RankEflS 

RankTot .Bit 3 Value.Bit 

RankTot. Value « CoobeSuaa<Value.Bit, Value. Value, "R", PolyK)) 
EasSet RankTot, RankTotLEN, ift, RankTotEHS 

NEXT 

IF LangS = "GERMAN" THEN 

OPrintRC ■ 12, 55, -1 

ELSE 

OPrintRC « 12, 47, -1 

END IF 



*— » sort in descending ordtr by RankTot. Value 

Sort EHSRank Info RankTotEHS, NuaRanU 

IF Nuaftanka > HaxShow then 
NuaFound = HaxSno* 

ELSE 

NuaFound — NusRankS 

END IF 

REDIrt Recsd TO NuaFound) AS RecInfoType 
RED LI RecsDebugSO TO NuaFound) "***rs debug 

FOR i - 1 TO NuaFound 

EosGet RankTot, RankTotLEN, CLNG(i), RankTotEHS 

Recs(i).Rec 3 RankTot.Rec 

RecsDebugtti) ■ RankTot. Value '***rs debug 

NEXT 

EnsRelRea RankEHS 

' Histogrea information 

•HistBars-a HinlntXCNuaFound, 25) 

REDId HatehRecVaUCI TO NuaFound) AS Ranklnfo 

FOR i 3 1 TO NuaFound 

EosGet RankTot. RankTotLEN, CLNG(i), RankTotEHS 

HatchReeVaLs(i).Rec * RankTot.Rec 

HatchRecVals(i). Value = RankTot. Value , **epg~ .75 

NEXT 

FOpenAll -AvgRank", 2, 4, AvgRankFILE • 
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IF AvgRankFILE < 0 THEN 

CALL FCreataCAvgRankM 

FCpenAll "AvgRank", 2, 4, AvgRankFILE 

AvgRankTot. Nua « 0 

AvgRankTot. Value = 0 

FOR it = 1 TO 17 

CALL FPutRTC AvgRankFILE. AvgRankTot, iS, L£N< AvgRankTot ) > 

NEXT i& 

END IF 

FGetRT AvgRankFILE, AvgRankTot, CLNGCExpr. SubExprd ) .Nun.) , LEM(AvgRankTot) 

IF AvgRankTot. Hum = 0 THEN . 

AvgRankTot. value = iiatehRecValsCI). Value 
NewAvgFlrstval! a natehReeValsCI). Value 

ELSE 

IF AvgRankTot .Nub = 1 then 

NevAvgfirstVal! = (AvgRankTot .Value • .5) * (Hat chReeVaUCJ). Value * .5) 

ELSE 

HuaOivNuePlusl ! = AvgRankTot -Nub / (AvgRankTot. Nua ♦ 1) 

NevAvgFirstVal! = (AvgRankTot.Value * NumDlvftuaPlusI!) ♦ (RatehRecVals(l). Value / (AvgRankTot .Nua ♦ 1}) 

END IF 

END IF 

AvgfirstVaU • AvgRankTot. Value 
AvgRankTot. Value - NewAvgFirstVal! 
AvgRankTot. Nua « AvgRankTot. Nua + 1 

FPutRT AvgRankFILE, AvgRankTot, CLNG<Expr.SubE>pr(1).Nua), LEN< AvgRankTot) 

FCtose AvgRankFILE 

EasReUtea RankTotEKS 
ERASE Record 

DispKsg "\ 0, 0 

EXIT sua 



KeepRee: 

IF RecortKHoriz, Vert) < "2" THEN 

this 1s the second tine we've seen this docuaent, so add It to 
■ our 11st of found documents to be' ranked 
Nun RankS = NuaRank* ♦ 1 . 
RankTot = BlankRankTot * — clear the variable 
RankTot.Rec * JCflnfo.Rec 

EflsSet RankTot. RankTotLEN, NuaRanU, Rant Tot EMS 

END IF 

Eas6et Value, ValueLEN, KYInfo.Rec, RankENS 

Value. Value - Value. Value ♦ KY Info. Value * PolyValue. Value 

SetBit Value. Bit, j, 1 

EasSet Value, ValueLEN, KYInfo.Rec, RankENS 
RecordOtori*, Vert) - -2- - 

RETURN 



SUB ReWriteHist (Curr) 

move the solid bar 
IF TernTypeHodeS = "LOCAL" THEN 
HistStart a 25 

ELSE 

HistStart * 24 
EltO IF 



FullFlag » FALSE 

Ratio! » (HatchRecVaU(l).Value " .5) / (AvgFiratVali " .5) . - 

Hist Flag • FALSE 
Start = 1 
IF Curr = 1 THEN 
St a 1 

ELSE 

St = Curr - 1 

END IF 

IF Curr = 25 THEM 
Fin = 23 

ELSE 

Fin a Curr ♦ 1 

END IF 

IF Fin > USOUND(natchSecVaLs) THEN Fin • UBOUND(RatchRecVals) 
FOR i = St TO Fin 'NuaBars 
FOR J = 1 TO 16 

XF CRatchRecValaCi). Value * .5 >• (.0623 * j * rtatehfiecValsO). Value " .5)) THEN 
QFrlntRC HistCharS, HistStart - j, 3 • (i - Start ♦ 1), NcroAttr 
QPrintRC HiatCharS, HistStart - j, 3 * (1 - Start ♦ 1) - 1. NoraAttr 
IF 1 «> 1 THEN QPrintRC HistCharS, HistStart - j, 3 • (i - Start ♦ 1) - 2, 0 

IF j a 1 AND 1 <> Curr THEN 

QPrintRC ■ HistStart - 1, 3 * (i - Start ♦ 1) - 2, RevAttr 
QPrintRC STRS(i), HistStart - 1, 3 • (i - Start ♦ 1) - 2, RevAttr 
IF i <> 1 THEN 

QPrintRC HlghCharS. HistStart - 1, 3 * (1 - Start ♦ 1) - 2, 0 

ELSE 
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Q?rintRC HighCharS, HistStart - 1, 3 • (I » Start ♦ 1) - 2, (8G AND 1) 

END IF 

QPrintRC His'-CharS, HistStart - 2, 3 • (1 - Start ♦ i). NoroAttr 

QPrintRC HistCharS, HistStart - 2, 3 * (1 - Start ♦ 1.) - 1, NoroAttr 

IF i <> 1 THEN QPrintRC HistCharS, HistStart - 2, 3 • (i - Start ♦ 1) - 2, 0 

END IF 

IF i ■ Curr THEN 
QPrintRC HighCharS, HistStart - j, 3 ♦ (i - Start ♦ 1), NoroAttr 
QPrintRC HighCharS, HistStart - j, 3 * (i - Start ♦ 1> - 1. NoroAttr 
IF" i «> 1 THEM QPrintRC HighCharS, HistStart - j. 3 * CI - Start ♦ 1) - 2, 0 
IF j « 1 THEM 

QPrintRC HighCharS, HistStart - 2, 3 • (i - Start ♦ 1), HoroAttr 

QPrintRC HighCharS, HistStart - 2, 3 * (i - Start + 13-1. HoroAttr 

IF i <> 1 THEN QPrintRC HighCharS, HistStart - 2, 3 * (i - Start ♦ 1) - 2, 0 

END IF 
■ DiD IF 
END IF 

'if no bar, print nuobor 

IF <SCR(BatehStecVaU(i).Valua) < (.0625 * SQRCftatchReeValad). Value))) THEN 
IF i a Curr THEN 

QPrintRC HighCharS, HistStart - 1, 3 ♦ « - Start * 1), HoroAttr 
QPrintRC HighCharS, HistStart - 1, 3 * <i - Start * 1) - 1, NoroAttr 
IF i <> 1 THEN 

QPrintRC HighCharS, HistStart - 1. 3 * (i - Start + 1) - 2, 0 

ELSE . 

QPrintRC HighCharS, HistStart - 1, 3 * (i - Start ♦ 1) - 2, NoroAttr 
END IF r. 
QPrintRC HighCharS. HistStart - 2, 3 * <i - Start * 1), HoroAttr 
QPrintRC HighCharS, HistStart - 2, 3 * <i - Start * 1) - 1, NoroAttr 
IF i <> 1 THEN 

QPrintRC HighCharS, HistStart - 1, 3 * (i - Start ♦ 1) - 2, 0 

ELSE 

QPrintRC HighCharS, HistStart - 1, 3 • Ci - Start ♦ 1) - 2, NoroAttr 

END IF 



ELSE 



QPrintRC ■ ", HistStart - 1, 0 ♦ 3 * CI - Start + 1) - 2, RevAttr 
QPrintRC STR$<i), HistStart - 1, 0 ♦ 3 * (i - start ♦ 1) - 2, RevAttr 
IF 1 «> 1 THEN 

QPrintRC HighCharS, HistStart - 1, 0 • 3 * CI - Start + 1)"- 2 r 0 

ELSE 

QPrintRC HighCharS, HistStart - 1. 0 ♦ 3 * (I - Start ♦ 1) - 2, NoroAttr 

END IF 

QPrintRC HistCharS, HistStart - 2, 3 * (i - Start * 1), NoroAttr 
QPrintRC HistCharS, HistStart - 2. 3 * <i - Start ♦ 1) - 1, NoroAttr 
IF 1 <» 1 THEN QPrintRC HistCharS. HistStart - 2. 3 * <1 - Start ♦ 1) - 2, 0 



END IF 
END IF 

NEXT j 

NEXT l 

IF Ratio! < 11 THEH 

F Msg Bow » 24 - 16 

ELSE 

FHsgRow » 24 - 16! / Ratio! 

END IF 
FHsgCol. = 1 

IF TeroTypeFlag THEN 
ch = 196 

ELSE * 
eh = asc("-") 

END IF 

IF LangS =» "GERMAN" THEN 

^ SE QPrintRC STRH*S<11. Ch) ♦ -HISTORISCHER DURCHSCHN 1TTSWERT FUER ERSTE DOKUHEHTE" ♦ STRIK6$(14, eh), FHsgRow, FHsgCol. NoroAttr 
^ ip QPrintRC STRINGS (11, ch) ♦ » AVERAGE RELEVANCE OF FIRST OOCUNENT FOR SIMILAR QUERIES" r STRINGS (10, eh), FHsgRow, FHsgCol, Nor 

_IF_Curr_>_1_THEN : 



IF RatchfiecVals(Curr) .Value « .6 * NatchRecVaLs(CuiT - 1). Value THEN 
IF LangS = "GHRlWI" THEN 

^ QPrintRC "Dokuoent" ♦ STRS(Curr> ♦ • veroutUch weniger relevant els vorhergehendes-. 7, 14, -1 
^ ^ QPrintRC -Docuoenf ♦ STRS(Curr) + " oay be less relevant than previous docuoent 7, 15, -1 

• FullFlag 3 TRUE 

END IF 
END IF 

IF Curr > 2 THEN 

IF HatchRecVals(Carr).Valuo < .4 * HatehRacVals(Curr - 2). Value THEN 
IF Langs = "GE^uw" then 

^ QPrintRC -Doktment" ♦ STR$(Curr> ♦ • veroutlich weniger relevant als vcrhergehendes ", 7. 14. -1 
^ Ip «^1»TRC -Docuoent" ♦ STRS(Curr) ♦ • oay be less relevant than previous docuoents", 7, 15, -1 . 

FullFlag <» TRUE 
END IF 

END IF 

CALL Histttessage 
DO 

00 

chS - INXEYS 
LOOP UNTIL chS «> 
IF LEN(chS) b 1 THEN 

C a A$:(UCASES(chS)) 
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ELSEIF LENCehS} = 2 THEN 

e = A5C( RIGHTS CchS. 1» ♦ 200 

END IF 

CASE^ESC^Cr! aightArrowJCey. LeftArrowKey, DirNuaKey, ShowExprtCey, HoneKey, t-&ey r 275, 277, F3, H, FS, F2, HH, EN 

HfstFlag = c 
CASE NewSearchKey. F10 
IF LangS = TSESfftAW" THEN 

IS =t a.esticnS("NEUE SUCHE ? CJ/N3 ", "iN", "ACHTUMG !") 
IF IS = "J" THEN 

Hi st Flag = c 

END IF 

ELSE 

IS » OuestionS(-New Search? CY/N3», *YNV "WARNING!") 
IF IS = -Y- THEN 
Hi st r Lag a c 
END IP 

EOT IF 
END SELECT 
LOOP UNTIL Hist Flog 

END SUB 

SU8 ScrollHist (Curr, Directions, KaxNua) 
IF TernTypeHodeS = "LOCAL" THEN 
HistStart =» 25 

ELSE 

HistStart * 2* 

END IF 

IF (Curr « naxNua AND NOT Last Flag) OS (Directions * "R") THEN 
IF Directions - "L" THEN 

ScrollL 7, 2, HistStart - 1, 80, 3, -1 
QPrintRC ■ % 9, 62, NoraAttr . _ 

QPrintRC " », 10, 62, NoraAttr 
FOR j - 1 TO 16 

• QPrintRC ■ 2* - j, 1, 21 

IF (HatcnftecVaLsCCurr - 1). Value * .5 >= (.0623 * j * HatchRecValsd). Value " .5)) THEN 
QPrintRC HistCharS, HistStart - J, 72, NoraAttr 
• QPrintRC HistCharS, HistStart - J, 71, NoraAttr 
QPrintRC HistCharS, HistStart - J, 70, 0 

END IF a 

NEXT 

FOR j * 1 TO 16 

IF (natchRecVaLs(Curr) .Value * .5 >• (.0625 * j * natchRecVa Lad). Value " .5)) THEN 

QPrintRC HighCharS, HistStart - j, 75, NomAttr 

QPrintRC HighCharS, HistStart - j, 7*, NoraAttr 

QPrintRC HighCharS, HistStart - j, 73, 0 
END IF 

NEXT 

• if no bar then print just mmber 

IF (HatchRecVals<Curr>. Value * .5 « C.062S * RatchRecValsd). Value * .5)) THEN 
QPrintRC HighCharS, HistStart - 1, 75, NoraAttr 
QPrintRC HighCharS, HistStart - 1, 74, NoraAttr 
QPrintRC HighCharS, HistStart - 1, 73, 0 
QPrintRC HighCharS, HistStart - 2, 75, NoraAttr 
QPrintRC HighCharS, HistStart - 2, 74, NoraAttr 
QPrintRC HighCharS, HistStart - 2, 73, 0 

END IF 

IF (itetchRacVaU(Curr).Value * .5 < (.0625 * HatchflecValsd). Value " .5)) THEN 
QPrintRC HistCharS, HistStart - 1, 72, NoraAttr 
QPrintRC HistCharS, HistStart - 1, 71. NoraAttr 
QPrintRC HistCharS. HistStart - 1, 70, O 
QPrintRC HistCharS, HistStart - 2, 72, NoraAttr 
QPrintRC HistCharS, HistStart - 2, 71, NoraAttr 
QPrintRC HistCharS, HistStart - 2. 70. 0 

END IF 

QPrintRC STRSCCurr - 1), HistStart - 1, 70, RevAttr 
QPrintRC " n . HistStart - 1, 70, 0 'NoraAttr 
IF Curr = KaxNun THEN 
Last Flag = TRUE 

^astFlag - FALSE 

END IF 
ELSE »OlrectionS»'R" 

Last Flag * FALSE 

ScrollR 7, 2, HistStart - 1, 80, 3, -1 . ■ 

CALL ClesrScr0(7, 76, HistStart - 1, 80, NoraAttr) » clear right portion 
FOR j » 1 TO 16 

' QPrintRC. • -\ HiatStert - J, 1, 21 

IF CnatchRecValeCCurr - HistStarO.Valua * .5 « (.0625 * j • HatehRecValsCI). Value * .5)) THEN 
QPrintRC HistCharS, HistStart - j, 3, NoraAttr 
QPrintRC HistCharS, HistStart - j. 2, NoraAttr 
QPrintRC HistCharS, HistStart - j, 1. 0 

ESQ If 

NEXT 

FOR i = 1 TO 16 

:? (HatehRecVaU(Curr - 230. Value * .5 *= (.0625 * j * HatchRecVsUd). Value * .5)) THEN 
QPrintRC HistCharS, HistStart - j. 4, 0 

END IF 

NEXT 

FOR j a 1 TO 16 

IF (RatchRecVals(Curr) .Value * .5 « (.0625 * J * nstshRecVelsCI). Value * .5)) THEN 
QPrintRC HighCharS, HistStart - j, 75, NoraAttr 
QPrintRC HighCharS, HistStart - j, 74, NoraAttr 
. QPrintRC HighCharS, HistStart - }, 73, 0 
END IF 

NEXT 

QPrintRC " HistStart -1,2, RevAttr 

QPrintRC LTRinS(STR$(Curr - HistStart)) , HistStart - 1, 2, RevAttr 
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QPrintRC • \ HlstStart-1. 1,0 • »RevAttr 
IF (Hatch»ecVals(Curr>. Value " .5 < (.0625 ♦ HatchRecVals(l). Value * .5)) THEN 
QPrintRC HighCharS, HistStart - 1, 75, Noraftttr 
QPrintRC HighCharS, HlstStart - 1, 74, NormAttr 
QPrintRC HighCharS, HlstStart - 1, 73, 0 
QPrintRC HighCharS, HlstStart - 2, 75, NormAttr • 
QPrintRC HighCharS, HlstStart - 2, 74, NoruAttr 
QPrintRC HighCharS, Hist Stare - 2, 73. 0 

0© IF 

IF (natchRecVals(Curr - HistStart). Value " .5 < (.0625 * HatchRecVaU(l). Value ' -5)) THEM 
QPrintRC HistChorS, HistStart - 2, 3. HoruAttr 
QPrintRC HiatCharS, HistStart - 2, 2, NoroAttr 

QPrintRC HistCharS, HistStart - 2, 1. 0 ' * 

QPrintRC STRSCCurr - HlstStart), HlstStart - 1, 1, RevAttr 

QPrintRC ■ *, HlstStart. 1, 0 "RewAttr ' 

END IF 

END IF 

IF LangS = "GERMAN" THEN 

^QPrintRC STRINGSC1Q, 32) ♦ • RELEVANI GRAPHIK CWEISSER BALKEN 1ST 6EG£HUA£RTtG=S flOaJHEHT)- ♦ STRIHGSdO, 32), 7, 1, -1 
^QPrintRC STRINGS (10, 32) ♦ • RELATIVE DOCUMENT RELEVANCE (S0U0 BAR IS CURRENT DOCUHENT)" ♦ STRIHGSdO, 32), 7, 1, -1 

IF Ratiol < II THEN 

FHsgRow = 24-16 

ELSE 

FHsgRow = 24 - 16! / Ratio! 

END IF 
FHsgCol » 1 
IF TeraTypeFlag THEN 
eh = 196 

ELSE 

ch = ASCC"-") 

END IF 

IP LangS =» "6ERHAN" THEN 

^ QPrintRC STRINGS02, ch) ♦ ■ HISTORISCHER DURCHSCHNITTSUERT FUER ERS-= DOKUHERTE - + STRIKGS(12, Ch), FHsgRow, FHsgCol 
END IF 5TRWCS<10 ' Ch) * * AVERA6E RELEVANCE OF FIRST DOCUHENT FOR SIMILAR QUERIES • ♦ STRINGSOO, eh), FHsgRow, FHs 

END IF 

CALL HistHessage 
DO 

DO 



LOOP UNTIL chS <> 
* IF LEN(ehS) - 1 THEN 

C = ASC(UCASES(ehS)) 
ELSEIF LEN(ohS) a 2 THEN 

e « ASCC RIGHTS (chS, D) 4 200 

END IF 

SELECT CASE C 

" SE ESC H1«FL^ t t rr0WK * y ' LefMrrOUK ^' ShewExorlCey. IteoeKey, E-«ey, 275, 277, F3, FA, FS, F2, HH, EN 

CASE NewSearchXey, F10 
IF LangS = "GERNAN" THEN 

1S = CuestionSCNEUE SUCHE ? CJ/W B , "JM*, "ACHTUN6 »■) 

IF 1S a M" THEN 

HistFlag = c 

END IF 

ELSE 

1S « Questions ("New Search? tY/NJ", "w\ "WARNING!") 
IF IS =* "Y" THEN 
HistFLag = c 
• END IF 

END IF 
END SELECT 
LOOP UNTIL HistFlag 

END SUB 



FUNCTION SelectRelativesS (Expr AS ExpresaionType, CurrentSub, TopKeysO AS CoUectT/se, NuaTepKeys) STATIC 

»— current subexpression is always 1 
"CurrentSub « 1 

• we're now editing the current SubExpression 



» no fun search has been done yet for this nodi-Hod expression 

Expr. Raich = -1 
CALL SnouExpr(Expr) 

LOCATE 2, 30 

IF LangS = "GERttAN" THEN 

Labels = "LEERTASTE fuer ein Oder oehrere Uorte' 
ELSE 

Labels = -Select One or Here Words" 
END IF 

IF Expr.SubExprOKNuo » 15 THEN EXIT FUNCTION 

REDIH PickedXd TO 15 - Expr.Sub£xpr(1).NuaJ . _ 

CALL PiekO»ice<To?*eys(), NuaToplceys, Labels, PiekedO, KuaPicked, ExaetFtag, Expr) 
IF NuaPicked ° -10 THEN • F10 pressed 
Global Status ■ He* Search 
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ELS EI F HumPicked > 0 THEN 

• store the selected words directly in the current subexpression 

add nuaber of w>rds selected to nusber of words in expression 
NuaExprUords ? NLscxpruords + NusPieked 

RED1H PRESERVE Exp r Codes C1 TO ftjaExprVords) AS CodaPolyType 

• FOR 1 = NusExprWords - NumPtdced + 1 TO HunExorUords 

Expr Codes CD.Codt = TopXeys(Plcked(i - NuaExprUords ♦ HuaPi eked)). Coco 
EesGetlSL ExprCodes(1).Pely, LEN(ExprCodes(1).Poly), ExprCodes(i).Coee. PolySeayEHS 
■? IF IHSTR((RIGHTS(xJ, LEN(xS) - 5)). * "> <> 0 THEN 

' ? Exprcodes(NuaExprUcrds) . Poly = ExprCodesCifcjoExpruords) . Poly * 1.4 

'? END IF 
NEXT 

SortT ExprCodesC1>, NuaExprWords, Descend, lEN(ExprCodesCI)), 2, -3 

Expr.SufeExpr( Cur rent Sub). Hue a NueExprWords 

FOR k « 1 TO NuaExorUords 

Code2Str Ex?r.Sub£xpr< Cur rentSub). Phrase, k, ExprCodes<k).COde 

NEXT 

set the status to adding Swaps 
GlobalStatus " SWAPS 



'■— return TRUE if words -ere selected 
SelectRelativesX « J .-^ektd » 0 

END FUNCTION 

SUB ShowHist (NuaShow, C^rr, Expr AS Expression Type) STATIC 

' — Shows the histograa 
FullFlag = FALSE 

IF TeroTypenodeJ = "LOOL" THEN 
IF LangS = "ger.ta.h- then 

Infoline* = CHRSC26) * ":Naesch Dolt - ♦ CHRS(27) ♦ "cVorher Dok =SC:«enue ENTER :JCurrfassung - 

else . 

Infolir.eS = CHRS(27) + Prev, ■ ♦ CHRS(26) ♦ »: Next. NOME: First. END: last, ENTER: Highlights, ESC: Go Back" 



ELSE 



END IF 
Hi at Start » 



IF LangS = "SEXKAV THEN 

InfolineS - CHRS(LeftKey) ♦ ": Prev. " ♦ CHRS(RightKey) ♦ Next, " ♦ CHRS(HoaeKey) ♦ ■: First. ■ ♦ CHRS(EndKay) + " 

ELSE 

END IF InfoUn *' ° CHRS(LeftArrowKey) ♦ Prev, • ♦ OfRSCRIghtArrawiCey) ♦ ": Next, " ♦ CHRS(HoaeKey) ♦ First, • ♦ CHRSCEn 
. Hi st Start - 24 

END IF 

CALL References (InfolineS) 

Ratio! * SQR(RatchReeVaLstl). Value) / SQRlAvgFiratVal!) 
HistFlag «• FALSE 
BEDIM AScr2X(1 TO 2000) 

CALL ClearSCrOtf, 1, HistStart - 1, 80. NoraAttr)' clear botte* portion 



' IF LangS = *6ERHAN* t*=s 

• RelevanceStrt = -RELEVAMZ GRAPH IK (UE1SSER BALKEH 1ST GEGENUAERTI6ES DOKUHENT)" 
'ELSE 

• RelevanceStrS « "RELATIVE OOdflENT RELEVANCE (SOLID BAR IS CURRENT DOCUMENT)" 
•END IF 

'RelStrLen = LEN(RelevaneeStrS) 
•spcLeft = (80 - RelStrLen) \ 2 
'SpcRight * 80 - RelStrLen - spcLeft 

'RelevaitceStrS ^ SpACESC SpcLeft) ♦ Relevances^ 

•QPrintRC RelevanceStrS. 24, 1, RevAttr • " 

IF CtifT <s 25 THEN 

.Finish s HinlntUNuaShew, 25) 
Start » 1 

ELSE 

Finish ~ Curr 
Start = Finish - 24 

END IF 

FOR 1 ■ Start TO Finish 'Nurtars 
FOR j « 1 TO 16 

IF (S&RlhatchRecVals(i).Value) »= (.0625 • j * SQR(HatchReeVals(1). Value))) THEN 
SPrintRC HistCharS, HistStart - J, 3 * (1 - Start ♦ 1), NoraAttr 
. QPrintRC HistCherS, HistStart - j, 3 * <1 - Start ♦ 1) - 1, aoraAttr 

IF i <> Start THEN QPrintRC NistCharS, HistStart - J, 3 * (1 - Start ♦ 1) - 2, 0 

if j ■ 1 AND i *> Curr THEN 

QPrintRC " \ HistStart - 1, 3 * <i - Start ♦ 1) - 2, RewAttr 
QPrintRC STRS(i), HistStart - 1. 3 ♦ Ci - Start ♦ 1) - 2. RevAttr 
IF i «> Start THEN 

QPrintRC • % HistStart - 1. 3 * (i - start • 1) - 2. 0 

ELSE 

QPrintRC HtghCharS, HistStart - 1, 3 * ti - Start ♦ 1) - 2, 1 

END IF 

QPrintRC HistCharS, HistStart - 2, 3 • (i - Start * 1), NoraAttr 

QPrintRC HistCharS, HistStart - 2, 3 * (i - Start ♦ T) - 1, NoraAttr 

IF i «> Start THEN QPrintRC HistCharS, HistStart - 2, 3 * Ci - Start ♦ 1) - 2, 0 

SV> IF 



05/25/2004, EAST Version: 1.4.1 



5,404,514 

275 276 

1= i = Curr THEN 
QPrintRC HighOiarS, HistStart - j, 3 * (i - Start ♦ 1), NcraAttr 
QPrintRC HighCharS, HistStart - j, 3 * Ci - Start ♦ 1) - 1, NoniAttr 
If i <> start THEN QPrintRC HighCharS, HistStart - j, 3 • Ci - Start ♦ 1) - 2, 0 
IF j = 1 THEN 

QPrintRC HighCharS, HistStart - 2, 3 * (i - Start - 1), NoraAttr 

QPrintRC HighCharS, HistStart - Z, 3 * ii - Start * 1) - 1, NoraAttr 

IF i «> Start THEN QPrintRC HighCharS r HistStart - 2, 3 ♦ Ci - Start ♦ 1) - 2, 0 

END IF 
=\3 IF 
END IF - 

'if nc bar, print nunber 

IF CSQR(Hat<:hRecVals(i).value) < (.0625 * SGR(HatchRecVals<1>.VaUe))> THEN 
IF i « Curr THEN 

QPrintRC HighCharS, HistStart - 1, 3 * (i - Start * 1), NoraAttr 
QPrintRC HighCharS, HistStart - 1, 3 * Ci - Start ♦ 1) - 1, NoraAttr 
IF i «> Start THEN QPrintRC HighCharS, HistStart - 1, 3 * <1 - Start ♦ 1) - 2. 0 
QPrintRC HighCharS, HistStart - 2. 3 * Ci - Start * 1), NoraAttr 
QPrintRC HighCharS, HistStart - 2, 3 • CI - Start ♦ 1) - 1, NoraAttr 
IF i <> Start THEN QPrintRC HighCharS, HistStart - 2, 3 * CI - Start ♦ 1) - 2, 0 
ELSE 

QPrintRC " ", HistStart - 1. 0 ♦ 3 * ti - Start ♦ 1) - 2, RevAttr 

QPrintRC STRSd), HistStart -1,0*3* CI - Start * 1) - 2, RevAttr 

IF 1 <> Start THEN QPrintRC HighCharS, HistStart - 1,. 0 * 3 * Ci - Start + 1) - 2, 0 

QPrintRC HlstCharS, HistStart - 2, 3 * Ci - Start ♦ 1), NoroAttr 

QPrintRC HlstCharS, HistStart - 2, 3 * Ci - Sxart «• 1) - 1, NoroAttr 

IF 1 <> Start THEN QPrintRC HlstCharS, HistStart - 2, 3 * Ci - Start ♦ 1) - 2, 0 

END IF 

END 

NEXT j 

NEXT l 

IF Ratiol < 11 THEN 

FHsgRou = 24-16 

ELSE 

FHsgRou = 24 - *6! / Ratiol 

END IF 
FHsgCol = 1 
IF TeraTypeFLag THEN 
eh - 196 

ELSE 

ch * ascc--') 

END IF 

IF LanoS = "GERMAN" Th"n - 

QPrintRC STRINGS ("2, ch) ♦ " HISTORISCHER CURCHSCHNITTSUERT FUER ERSTE DOKLFENTE " ♦ STRIN6SC12, ch), FHsgRou, FHsgCol, NoraAt 

ELSE 

QPrintRC STRINGS (10, ch) ♦ ■ AVERAGE RELEVANCE OF FIRST DOCUMENT FOR SIMILAR OUERIES " ♦ STRINGS (10, eh), FHsgRov, FHsgCol, No 

END IF 

IF Curr 1 ^ H ™! |t€VaUCeiuPr ,.y aLII0 < . 6 . natehRecVaU(Curr - D.Value THEN 



IF LanoS 3 "GERMAN" THEN . . . _ _ m . „ 

QPrintRC "Doltunenf ♦ STRSCCurr) ♦ " veniutlieh weniger relsvant aU vcrhergehences", 7, 14, -1 
-icb 

QPrintRC "Docuaenf ♦ STRSCCurr) ♦ - nay be Less relevant than previous document", 7, 15, -1 

* END IF 

Full Flag = TRUE 

END IF 
END IF 

IF Curr * F Z Ha ^J RecVaLs(Curr) . uolue < .4 * HatchRetVaUCCurr - 2). Value THEN 

1F Un9 J Pr s ntRC^k^te ♦ STRS(Curr) *■ " veroutlich uenlger relevant als ujmergehendes", 7, 14, -1 
ELSE QPrintRC "Doeuaent" ♦ STRS(Curr) ♦ ■ say be less relevant than previous docunents", 7 r 15, -1 
END If 

FullFlag - TRUE 
END IF 

END IF 

CALL HistHeaaege 

-DO - 

DO 

chS a INXEYS 
LOOP UNTIL ehS <> "" 
IF LEHCchS) a 1 THEN 

c = ASC(UCASESCchS)) 
ELSEIF LENCchS) = 2 THEN 

C = ASCCRIGHTSCchS, D) > 200 

END IF 

C^"scfcR? RigntArrowfey, LeftArrovXey, OirNuaXey. ShowExprKey, HoaeKey, Bx*ey, 273, 277, F3, F4, FS. F2, HN, EN 

Hist Flag = c 
CASE NeuSeerchKey, F10 
IF LanoS » "GERMAN" THEN 

» « QuestionSC»NEUE SUCHE ? CJ/W ", "JN«, "ACHTUNB I") 
IF iS « - J" THEN 

Hi stf teg * c 

END IF 



ELSE 



iS ■ QuestionSC"Neu Search? CT/N3"; *YN", "WARNING!") 
IF IS • *V* THEN 
KistFLag * c 
END IF 



END IF 
END SELECT 
LOOP UNTIL HistFleg 

■CALL NScmRcstCI. 1, 25, 80, SEG AScr2XO)> 
ERASE AScr2X 

END sua 
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SUB ScrtEHSRanklnfo (HandleX, NuaElsS) STATIC 

sorts Rank Info types by Value in Descending order 

DIfl Arrayl AS Ranklnfo, Arrey2 AS Ranklnfo 
trroylEH » LEN (Arroyl) 

SpanS - NunElsS \ 2 
DO WHILE SponS > 0 

FOR i& * SpanS TO NuaElsS - 1 
j& = iS - SpanS *. 1 

FOR jS = (iS - SpanS + 1) TO 1 STEP -SpanS 
EasGet Arrayl, ArrayLEN, jS, HandleX 
EasGet Array2, ArrayLEN, jS ♦ SpanS, HandleX 

IF Array2. Value <= Arrayl. Value THEN EXIT FOR 

Swap array eleeents that are out of order. 

EasSet Arrayl, ArrayLEN, JS ♦ SpanS, HandleX 
EasSet Array2, ArrayLEN, jg, HandleX 

REXT jS 

NEXT iS 

SpanS = SpanS \ 2 

LOOP 
END SUB 

SUB SortSvapEHS (handles. Nua£lsX) STATIC 

sorts Synth BitValue types by Value 1n Descending order 

DIN Arrayl AS BitValue, Array2 AS BitValue 
ArrayLEN » LEH (Arroyl) 

SpanS » RuaEls \ 2 
DO WHILE SpanS > 0 

FOR iS " SpanS TO NwEU - 1 • 
jS = iS - ScanS ♦ 1 

FOR jS = (i8 - SpanS ♦ 1) TO 1 STEP -SpanS 
EasGet Arrayl, ArrayLEN, jS, HandleX 
EasGet Array2, ArrayLEN, jS ♦ SpanS, HandleX 

I? Array2. Value <= Arrayl. Value THEN EXIT FOB 

Snap array elements that are out of order. 

EasSat Arrayl, ArrayLEN, jS ♦ SpanS, HandleX 
EasSet Array2, ArrayLEN, jS, HandleX 

NEXT ji 

NEXT \& 

SpanS ^ spanS \ 2 

LOOP . 

END sua 

SUB Teraflatch (TeraxS, E*pr AS Express icnType. a() AS CoUectType, NuaN, Repeate, E*3=:Flag) STATIC 

' collect code ff's froa the Dictionary which notch Teres 
1 uses the saae technique as the CheckKey in the AIM progrea 
' which uses the table tyt the first 3 chars and doea a 

• "reverse* eaten, i.e., matching all words in the dictionary 

• to the TeraS 

• Note that the arrays necessary for these routines are globally 
' shared, so do not neeo to be passed 

• Also Note: The SINGLE*. STR and COHBXEY.STR used below are not 
' in Bseory, but are accessed directly froa disk using FGetRT 

DIN WordCoapare AS Sinc<eyType 
DIN CoabKeyTeap AS CcabCeyType 

-DIN SingKeyTeap AS SingKeyType- . _l 

DIN DictTeop AS Diet Type . 
Repeats a FALSE 
IF LangS = "ENGLISH" TrES 
Terras a TeraxS 

ELSE 

TernS ■ LCASES (TeraxS) 

END IF 

ExactFlag = FALSE 

' replace only SOKE punctuation with spaces 

HewS • ■ ' 

OLdS » V/.-OOO" 

FOR j • 1 TO LEH(OLdS) 

CALL Rep LeceChar (TeraS, NIDSCOldS, j, 1), HewS) 

NEXT 
Nuefl = 0 

SELECT CASE ASCCTermS) 
CASE AscA TO ASCZ: SearchStep = 1 
CASE AscUpperA TO ASCUpperZ: 

IF ASCCHlDS(TeraS, 2, 1» < AscUpperA OR ASCOUDSCTeraS, 2. 1» > ASCUpperZ THEN 
SearchStep a 2 
END IF 

CASE ELSE: SearchStep = 3 

END SELECT 

IF INSTRlTeraS, • •> «> 0 GOTO FindCoabKsy 
OriginTereS a T«r«$ 

• check if the first 3 letters of the word return 

1 a valid range froa the 3-dieensienal table array 
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IF FirstLastX(LCAS£S<TeraS). First, Last, Sing) THEN 'yes. search thru range 
AddSearchAgaln: 

for i a First TO Lost 

' CALL FGetRT(SingKeyFlLE, SingKeyTeap, CLKG(j). l£NCSine*eyTeop)) 
CurrKeyS = HTRXHSCSingKeyTeap.Str) 
Exact ■ f ALSE — 
DO WHILE RI6HTS(CurrKeyS, 1) = V 
Exact * TRUE 

CurrK» 7 i = LEFTS (CurrKeyS, LEN(CurrKeyS) - 1) 

LOOP 

coepare the single keyword CCurrKeyS/SingKeyTenp.Str] 
• against the search word CTemSJ 

Hatch » (Teres * LETOCqurrKeyS, LENUermS))) 

IF NOT Exact AMD NOT Hatch THEN 

Hatch = (CurrKeyS ° UFT$(Tcm$, LENCCurrKeyS))) 

END IF 

IF Hatch THEN 60SUB SaweSingStr 

KEXT' key in range 

IF NOT Hatch THEN 

^ri^TeV^-I^SCLEWTennS. D) ♦ LCASESCRI6HTS(TersS, LOKTemS) - 1» 
SearchStep = SearchStep + 1 
CASE 2: TeroS = UCASESlTeraS) 

SearchStep » SearchStep + 1 
CASE ELSE: GOTO AddNextKey 
END SELECT 
GOTO AddSearchAgaln 
ELSE GDS-3 SaveSingStr 

IF TernS a CurrKeyS THEN Exact f Lag = TRUE 

END IP 

AddNextKey: 

END If* the range was valid 

F0 *CALL FGe^RKSingK>v?ILE, SingKeyTeap, CLHG(j), LEN(Sir**eyTeop)) 
CurrKeyS » RtRINS(SingtoyTeflp.Str) 
00 WHILE RI6HTSCCurrKeyS, 1) ° ' _ ^ 4 ^ . 

CurrtCeyS = LEFTS (CurrKeyS. LEH(CurrKeyS) - 1) 
LOOP 

IF CurrKeyS a LEFTS 'TeroS, LENCCurrKeyS) > THEN 

GOSUe SaveSingStr 
ENO IF 
NEXT 

Done searching tor single keys, go into FindtoebKey routine new 
'TeroS * TeroS ♦ - " 

EXIT SUB 

I _^-^Q-r~~- "■" f^ nm " 

FindCoobKey: 

TeroS ' LCASES(TeraS) 

Mr It's a valid range, tnen check w^ 3 ^ 1 ^ 
IF FtrstLestXCTeroS, First, Last, Coob) THEN 
FOR j - First TO Last 

CALL F6etRT(CoabKeyFILE, CoebKeyTetip, CLKG(j), LEN(CoobKeyTe»p)) . 

^TiftCoun^IW^ - ') ♦ V count number of wcra In coabined key 

CfifflbKeyTev.Str = LCJUES(CertKeyTeap.Str) 

CALLEx-ract ( CcnbKeyTeiip. Str , ■ 1. Strt, SLen) 'extract first wore 
S^TJiDlSSK^cep.Str, Strt, Slen)' of coabined keyworc 

IF RIGHTS C CurrKeyS, 1) i VI THEN _. 

Exact • TRUE 

IP RIGHTS < CurrKeyS, 2) = "//" THEN * 

Slen « Slen - 2» account for // at end 

ELSE 

Slen a Slen - V account f or / at end 



ENS IF 

CurrKeyS * LEFTS (CurrKeyS, Slen) 
E*ect « FALSE 



ELSE 
END IF 

• get first word to < — , — - 
CALL Ext-actCTeraS, " \ 1, Strt, DLen) 
DocworoS • RlOSCTereS, Strt, DLen) 
•coapare first word of coabined key C CurrKeyS] 
'aqainsx the first word of the search word CDocwordS] 
Hatch = CDoeyordS * LEFTS (CurrKeyS, LEH(DocttordS))) 
IF NOT Hatch THEN 

IF Exact THEN ' check for *exact* oateh 
Hatch - {CurrKeyS = OocUordS) 

ELSE 

Hatch ? (CurrKeyS = LEFTS (BocUcrdS, Slen)) 

END IF 

END IF 

• TO match, skip to next coabined key.ln the First-Last range 
IF NOT hatch GOTO NextCoabKay 

- • if Ter*S was a single keyword, then skip over continued catching 
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IF RIGHTJCTeral, 1) = " ■ GOTO CoabKeyRatched 

• continue Batching the rest of the words in the ccabined key 
1 exiting out as soon as there's a non-natch 

FOR k = 1 TO Words - V rusher of words left in ccabined key 

' extract the next word fron the current combined keyword (j) 
CALL Extract<Cc*bKeyTeap.Str, ■ k ♦ 1, Strt, Slen) 
CurrKeyS - HIDS<ConbKeyTesp.Str, Strt, Slen) 
IF RIGHTSCCurrKeyS, 1) • »/- THEM ■ reaove / at end of word 
Exact = TRUE 

CurrKeyS « LEFTSCCurrKeyS, Slen - 1) 
Slen « Slen - 1' account for / at end 

ELSE 

Exact ■ FALSE 

ESD IF 

.' get next word to eoapare 

CALL ExtractCTeraS, ■ ■„ k ♦ 1, Strt, OLan) 

DocvordS 3 HID$(TaraS, Strt, OUn) 

IF CurrKeyS « "a** THBI ' special processing for 3 wildcard 
IF INSTRUtLlstS, •/" ♦ OocUordS ♦ "/"> THEN 

Hatch s TRUE* the word was 1n the a 11st, sc continue 

ELSE 

Hatch a FALSE 

■ END IF 

ELSE 

Hatch * COocwordS = LEFTSCCurrKeyS, LEHCSocVordS))) 
IF HOT Hatch THEN 

IF Exact THEN * Check for *exect* match 

Natch o (CurrKeyS = DocvordS) 
ELSE ' wildcard aatch, only eoapare ff of chers in CurrKeyS 
Natch = (CurrKeyS • LETOCDocwordS, Slen)) 

END IF 

END IF 

END IF 

IF NOT Hatch THEN EXIT FOR 
NEXT 1 word in current coebined keyword 

CosbKeyHatched: 

IF Hatch THEN G0SU8 SaveCoabStr 

NextCoBbXey: 
NEXT 

END IF 1 Table range was valid 

•TeraS o RTRIHSCTera*) 
'FOR j - 1 TO NmCoabKty 

• CALL FGatRTCCoabKeyFILE, CoabKeyTeap, CLNSCJ), LSKCoobXeyTeap) ) 

• CurrKeyS = LEFTS ( CoabKeyTeap. Str, IHSTR ( CoabKeyTeap. St r , " •) - 1) 

• DO WHILE RIGHTSCCurrKeyS. 1) = V 
■' CurrKeyS a LEFTSCCurrKeyS, LEN(CunKeyS) - 1) 

LOOP 

' IF CurrKeyS = LEFTS CTeraS, LEN(CurrKeyS)) THEN 
> 60SUB SaveCoabStr 

• END IF 
•NEXT 

EXIT sua 



• check 1f this Code was already added by a previous aatch to a synonya 

SaveCoabStr: 

don't need the diet lookup, the cade is in the .ST! structure already 
•** 4/26/91 4:50o thy 
•OictTeep.str « CoabKeyTeap. Str, 

TeapCode a CoabKeyTeap. Code 
GOTO Checkstr 



SaveSingStr: 

»— don't need the diet lookup, the coda la in the .STR structure already 
'*» 4/26/71 4:S0p THY 
'OictTtap.str a SingKayTeap.Str 

TeapCode a SingXeyTeap.Code 

CheckStr: 



don't need the diet lookup, the code Is in the .STR structure already 
■** 4/26/91 4:Sto THY 
•TeapCode = OlctSrchXCDictTeap) 

IF mndx(TeapCode).Nua > 0 THEN ' it appeared in this database 
Found « FALSE 
FOR i • 1 TO NuaH 

i? atiKCode - TeapCode THEN Found • TRUE; EXIT FOR 

NEXT ' 
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If NOT Found THEN 'then add the single keyword to the lilt of catches 
check to see if it's already in the expression 
IF KeyInstrKExpr.Sub£xpr<1). Phrase, HKI$(TorpCode)) « 0 Th£s 
NuaH s kusm ♦ 1 

REDID PRESERVE m<1 TO KuaM) AS CoUectType 
n<Nual0.code = TenpCode 

ELSE 

Repeate = TRUE 

END IF 

END IF 

END IF 



What is claim pd is: 

1. A method of indexing and retrieving documents, 
said method using a digital , computer system having a 
central processing unit, a memory, a display screen, a 15 
keyboard, and a large capacity file system, said method 
comprising the steps of: 

(a) storing in said memory a vocabulary of terms, 
each term consisting of one or more words, and for 
each term an associated term-code; 20 

(b) storing on said file system a collection of docu- 
ments each with an associated unique document- 
number; 

(c) creating index files which contain for each said 
term-code in (a) 25 
(0 the set of document-numbers in (b) such that the 

corresponding documents contain the corre- 
sponding term; and 
(n) for each said document-identifymg-number in 
(i) the rrequency-in-document of the correspond- 30 
ing term which is the number of times that said 
term appears in the corresponding document; 

(d) creating a weight-in-document file which contains 
for each document-number in (c)(i) the weight-in- 
document of the corresponding term which is cal- 35 
culated using the rrequency-tn-document in (c) (ii), 
the number of document-numbers in (c) (i), and the 
total number of terms in (a) which are in the corre- 
sponding document (counted multiple times); 

(e) creating a frequent-companion file which contains 40 
for each occurring term-code in (a) a ranked set of 
pairs of numbers where each pair consists of a first 
element term-code and a second element compan- 
ion-percentage, where the companion-percentage 

is calculated by summing the weight-in-document 45 
values of said first element term-code over docu- 
ments that contain both the term corresponding to 

- said first dement term-code and the term corre- — 
sponding to said occurring term-code and then 
dividing by the sum over all documents of the 50 
weight-in-document of said occurring term-code; 

(0 creating a relative file which contains for each 
occurring term-code in (a) a ranked set of pairs of 
numbers where each pair consists of a first element 
relative term-code and a second element relative- 55 
percentage, where the relative-percentage is calcu- 
lated by taking a weighted average of the compan- 
ion-percentage of said first element term-code cal- 
culated in step (e) and the companion-percentage 
of said occurring term-code that was calculated in 60 
step (e) when said first element term-code was the 
occurring term-code and said occurring term-code 
was the first element term-code; 

(g) creating a polysemantic file which contains for 
each occurring term-code in (a), a polysemantic 65 
weight which is calculated using the number of sets 
of pairs in the relative file created in step (f) that 
said occurring term-code appears in, the number of 



documents-numbers for which the weight-in-docu- 
ment of said occurring term-code ealrgiMfd in step 
(d) is greater than some threshold value, and the 
averages for several values of N of the first N rela- 
tive-percentages of said occurring term-code cal- 
culated and ranked in step (f); 
(h) accepting a query consisting of a sequence of 
words entered by a user using said keyboard and 
creating a parsed-query table of term-codes which 
consist of the term-codes in said vocabulary that 
are associated with the terms that are contained in 
said query; 

(Q creating a temporary swap table of pairs of first 
element term-codes and corresponding second 
element summed-relative-percentages consisting of 
those relative term-codes created in step (f) where 
said corresponding second element summed-rela- 
tive-percentages are the sum, over all said occur- 
ring term-codes that are in said parsed-query table, 
of the relative percentages of said first element 
term-codes; 

0) creating a modified swap table by modifying said 
second element surnmed-relative-percentages cre- 
ated in step (i) by multiplying them by a function of 
the polysemantic weight of the corresponding first 
element term-codes; 

(k) sorting said modified swap table by said modified 
summed-relative-percentages in descending order; 

0) displaying on said display the terms corresponding 
to the term-codes of said modified swap table; 

(m) accepting user keypresses or other actions which 
identify one or more of the terms displayed in step 
(I) and adding the corresponding term-codes to the 
parsed-query-table; 

(n) repeating steps (i) through (m) as many rimes as 
the user indicates by his input; 

(o) accepting an input from the user indicating a com- 
mand to retrieve documents; 

(p) creating a temporary rank table of pairs of first 
element document-numbers and corresponding 
second element summed-document-weightxpoly 
values which pairs comprise those document-num- 
bers for which any of the term-codes that are in 
said parsed-query table have weight-in-document 
above a threshold value, and summed-document- 
weight X poly values which are the sums, over all 
term-codes in said parsed-query table, of a function 
of me polysemantic weight of the term-code and 
the weight-in-document of the term-code; 

(r) creating a sorted rank table by sorting said tempo- 
rary rank table by the value of the second elements 
of the pairs in descending order; 

(s) displaying on the display screen some portion of 
the document corresponding to the first document 
number in the sorted rank table and some indica- 
tion of the corresponding surnmed-document- 
weightXpoly value; 
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(t) displaying other documents corresponding to 
other document-numbers in the sorted rank table in 
response to inputs from the user. 

2. A method as in claim 1 wherein additional steps 
(j)(l) and (p)(l) are carried out after steps Q) and (p). 5 
respectively to implement the soft boolean connector 
algorithm which consists of the following steps: 

(A) creating a table of relative penalties for each pair 
of said term-codes in said parsed-query table where 10 
said relative penalty is a function of the relative 
percentage corresponding to the two term-codes of 
said pair, the number of documents that each of the 
term-codes of the pair are contained in with a 
document-weight above a threshold, and the aver- 15 
age over all terms of the number of documents that 
the term is contained in with a document-weight * 
above said threshold; 



parsed-query table to produce a modified sum of 
penalties; 

(E) summing some function of the polysemantic 
weights of the term-codes in the parsed-query table 
that are either relatives of a potential SWAPS term 
Ql) or are contained in a document (pi) to produce 
a number of hits value; 

(F) Calculating some function of the number of hits 
value and the modified sum of penalties value to 
produce a power value; 

(G) Raising a number approximately equal to 2 to the 
power value to produce an adjust value; 

(H) Multiplying either the modified summed relative 
percentages calculated in step j) or die summed 
document weight X poly values calculated in step 
(p) by the adjust value. 

3. A method as in claim 1 where the formula for 
calculating the weight-in-docmnent in step (d) is: 



LogzCFreqlnDoc 



Weight(Word) = - 



(B) modifying said relative penalties by taking the 
minimum of the relative penalty and some maxi- 



+ l)XLog2^ 



TotDocs X 1.5 



DocsWhhWorf + 3 + t2 ^ S - 



Log2 (2 + TotelKeywordsIiiDoc ^ 



4. A method as in ciaim 1 where the formula for 
calculating the polysemantic weight in step (g) is: 



FolyValoe 



Avg3 Avg6 



Avgffl ) x 



TotRdVal 
DocFreq 1 - 2 



35' 



mum value which depends on the number of terms 5. A method as in claim 1 where the function in step 

m the parsed^uery table; 0) * the identity function. ^ 

(Q summing said modified relative penalties to pro- . 

duce a sum of relative penalties; 40 <>• A method as in claim 1 where the function in step 

(D) modifying said sum of relative penalties by taking (p) is the identity function. 

the minimum of said sum and some Tnaxinrnm aim 

value which depends on the number of terms in the ***** 

45 



50 



55 



60 



65 



05/25/2004, EAST Version: 1.4.1 



United States Patent U9j 

Turtle et al. 



US005488725A 
[li] Patent Number: 
[45] Date of Patent: 



5,488,725 
Jan. 30, 1996 



[54] SYSTEM OF DOCUMENT 

REPRESENTATION RETRIEVAL BY 
SUCCESSIVE ITERATED PROBABILITY 
SAMPLING 

[75] Inventors: Howard R. Turtle, Rosemount; Gerald 
J. Morton, South St Paul; F. Kinky 
Larntz, Shore view, all of Minn. 

[73] Assignee: West Publishing Company, Eagan, 
Minn. 



[21] Appl. No.: 39,757 

[22] Filed: Mar. 30, 1993 

Related U.S. Application Data 

[63] Continualion-in-part of Ser. No. 773,101, Oct 8, 1991, Pat. 
No. 5,265,065. 

[51] Int. Cl. fl G06F 17/30 

[52] U.S. CL 395/600; 364/419.19; 364/DIG. 1; 

364/282.1; 364/282.3 
[58] Field of Search 395/600; 364/419.19 

[56] References Cited 

U.S. PATENT DOCUMENTS 

4,384,329 5A983 Rosenbaum et al. 364/300 

4,422,158 12/1983 Galie .... 395/400 

4,554,631 11A985 Reddington 364/2835 

4,843,389 6/1989 lisle et al 341/106 

4,870,568 9/1989 Kahle et al 395/600 

5,109,509 4/1992 Katayama et al 395/600 

5,159.667 10/1992 Boirey et al - 395/148 

5,220,625 6/1993 Hatakeyama et al 382/54 

5^63,159 11/1993 Mitsui 395/600 

5,265,065 11/1993 Turtle 395/600 

5^78^80 1/1994 Pedersen et al 395/600 

5,297,042 3A994 Morita 364/419.19 

5,301,109 4/1994 Landaueretal 364/419.19 

5,321,833 671994 Chang eta] .-. 395/600 

5.325,298 6/1994 Gallant 364/419.19 

—5,335345 -8/1994 Prieder-et al.- « -395/600 

5,418,948 5/1995 Turtle 395/600 



OTHER PUBLICATIONS 

M. E. Smith, "Aspects of the P-Norm Model of Information 
Retrieval: Syntactic Query Generation, Efficiency, and 
Theoretical Properties" Ph. D Thesis, Department of Com- 
puter Science, Cornell University, Ithaca, NY., TR 90-1128 
(May, 1990). pp. 116-120. 

Buckley et al., "Optimization of Inverted Vector Searches", 
Proceedings of the Association for Computing Machinery 
(SIGIR85), 1985, pp. 97-110. 

Turtle et al., "Uncertainty in Information Retrieval Sys- 
tems", Provisional Proceedings on Uncertainty Manage- 
ment in Information Systems, sponsored by the National 
Science Foundation and ESPIRIT, Majorca, Spain, Sep. 23, 
1992, pp. 111-137. 

Croft et al, "A Retrieval Model Incorporating Hypertext 
Links", Hypertex '89 Proceedings, Association for Com- 
puter Machinery, pp. 213-224 (Nov 1989). 
Turtle et al, "Inference Networks for Document Retrieval", 

(List continued on next page.) 

Primary Examiner— Thomas G. Black 
Assistant Examiner— Wayne Amsbury 
Attorney, Agent, or Firm— Kinney & Lange 



[57] 



ABSTRACT 



An information retrieval system based on probabilities that 
documents meet information needs. The frequency of occur- 
rence of a representation in a collection of documents is 
estimated by identifying the frequency of occurrence of the 
representation in a sample of documents and calculating the 
difference between the maximum and minimum probable 
frequencies of occurrence of the representation in the col- 
lection. If the difference does not exceed a limit, a midpoint 
of the maximum and minimum probable frequencies is the 
estimated frequency of occurrence of the representation. 

Document distribution probabilities are optimized and prob- 
ability thresholds are established for the identification of 
documents. An initial probability threshold is established 
and is adjusted as the probabilities are scored for documents 
in samples. The document result list is iteratively adjusted 
through the samples. 

46 Claims, 13 Drawing Sheets 
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SYSTEM OF DOCUMENT 
REPRESENTATION RETRIEVAL BY 
SUCCESSIVE ITERATED PROBABILITY 
SAMPLING 

5 

CROSS REFERENCE TO RELATED 
APPLICATIONS 

This application is a continuation-in-part of application 
Sen No. 07/773,101 filed Oct 8, 1991, U.S. Pat No. m 
5,265,065. 

BACKGROUND OF THE INVENTION 

This invention relates to information retrieval, and par- 
ticularly to document retrieval from a computer database 15 
using probability techniques. More particularly, the inven- 
tion concerns a method and apparatus for establishing prob- 
ability thresholds in probabilistic information retrieval sys- 
tems and for estimating representation frequencies in 
document databases for representations having no pre-com- 20 
puted frequency. 

There are, in theory, two categories of information 
retrieval systems: algebraic systems and probabilistic sys- 
tems. Algebraic systems logically match terms and their 
positions in a stored information (such as a document) to 25 
terms in a query; Boolean systems are examples of algebraic 
systems. Probabilistic systems match representations (con- 
cepts) in a stored information to concepts in a query to 
retrieve information based on probabilities rather than alge- 
braic or Boolean logic. 30 

Presently, document retrieval is most commonly per- 
formed through use of Boolean search queries to search the 
texts of documents in the database. These retrieval systems 
specify strategies for evaluating documents with respect to 35 
a given query by logically comparing search queries to 
document texts. One of the problems associated with text 
searching is that for a single natural language description of 
an information need, different Boolean researchers will 
formulate different Boolean queries to represent that need. ^ 
Because the queries are different, different documents will 
be retrieved for each search. 

Another difficulty with Boolean systems is that all docu- 
ments meeting the query are retrieved, regardless of number. 
If an unmanageable number of documents are retrieved, the 45 
searcher must reformulate the search query to more nar- 
rowly define the information need, thereby narrowing the 
retrieved documents to a more manageable number. How- 
ever, in narrowing the search, the researcher risks missing 
relevant documents partially meeting the information need. 50 
Moreover, Boolean systems will not retrieve documents 
only partially meeting the query, which themselves are often 
important secondary documents to the query. 

More recently, probabilistic systems employing hypertext 
databases have been developed which emphasize flexible 55 
organizations of multimedia "nodes" through connections 
made with user-specified links and interfaces which facili- 
tate browsing in the network. Early networks employed 
query-based retrieval strategies to form a ranked list of 
candidate "starting points" for hypertext browsing. Some 60 
systems employed feedback during browsing to modify the 
initial query and to locate additional starting points. Net- 
work structures employing hypertext databases have used 
automatically and manually generated links between docu- 
ments and the concepts or terms that are used to represent 65 
their content. For example, "document clustering" employs 
links between documents that are automatically generated 



2 

by comparing similarities of content. Another technique is 
"citations" wherein documents are linked by comparing 
similar citations in them. 'Term clustering" and "manually- 
generated thesauri" provide links between terms, but these 
have not been altogether suitable for document searching on 
a reliable basis. 

Deductive databases have been developed employing 
facts about the nodes, and current links between the nodes. 
A simple query in a deductive database, where N is the only 
tree variable in formula W, is of the form {N1W(N)>, which 
is read as "Retrieve all nodes N such that W(N) can be 
shown to be true in the current database." However, deduc- 
tive databases have not been successful in information 
retrieval. Particularly, uncertainty associated with natural 
language affects the deductive database, including the facts, 
the rules, and the query. For example, a specific concept may 
not be an accurate description of a particular node; some 
rules may be more certain than others; and some parts of a 
query may be more important than others. For a more 
complete description of deductive databases, see Croft et al. 
"A Retrieval Model for Incorporating Hypertext links", 
Hypertext '89 Proceedings, pp 213-224, November 1989 
(Association for Computing Machinery), incorporated 
herein by reference. 

A Bayesian network is a probabilistic network which 
employs nodes to represent the document and the query. If 
a proposition represented by a parent node directly implies 
the proposition represented by a child node, an implication 
line is drawn between the two nodes. If-then rules of 
Bayesian networks are interpreted as conditional probabili- 
ties. Thus, a rule A-^B is interpreted as a probability P(BIA), 
and the line connecting A with B is logically labeled with a 
matrix that specifies P(B!A) for all possible combinations of 
values of the two nodes. The set of matrices rwinting to a 
node characterizes the dependence relationship between that 
node and the nodes representing propositions naming it as a 
consequence. For a given set of prior probabilities for roots 
of the network, the compiled network is used to compute the 
probability or degree of belief associated with the remaining 
nodes. 

An inference network is one which is based on a plausible 
or non-deductive inference. One such network employs a 
Bayesian network, described by Turtle et aL in 'Inference 
Networks for Document Retrieval" SIGIR 90, pp. 1-24 Sep. 
1990 (Association for Computing Machinery), incorporated 
herein by reference. The Bayesian inference network 
described in the TurtleeTal: article comprises a document 
network and a query network. The document network rep- 
resents the document collection and employs document 
nodes, text representation nodes and content representation 
nodes. A document node corresponds to abstract documents 
rather than their specific representations, whereas a text 
representation node corresponds to a specific text represen- 
tation of the document A set of content representation nodes 
corresponds to a single representation technique which has 
been applied to the documents of the database. 

The query network of the Bayesian inference network 
described in the Turtle et aL article employs an information 
node identifying the information need, and a plurality of 
concept nodes corresponding to the concepts that express 
that information need. A plurality of intermediate query 
nodes may also be employed where multiple queries are 
used to express the information requirement 

The Bayesian inference network described in the Turtle et 
al. article has been quite successful for small, general 
purpose databases. However, it has been difficult to formu- 
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late the query network to develop nodes which conform to 
the document network nodes. More particularly, the infer- 
ence network described in the Turtle et al, article did not use 
domain-specific knowledge bases to recognize phrases, such 
as specialized, professional terms, like jargon traditionally 
associated with specific professions, such as law or medi- 
cine. 

One important aspect to probabilistic retrieval networks, 
such as aBayesian inference network, is the identification of 
the frequency of occurrence of a representation in each 
document and in the entire document collection. A repre- 
sentation that occurs frequently in a document is more likely 
to be a good descriptor of that document's content A 
representation that occurs infrequently in the collection is 
more likely to be a good discriminator than one that occurs 
in many documents. Consequently, when creating a database 
for a probabilistic network, care is taken to identify the 
representations (content concepts) in the documents, as well 
as their frequencies. However, it is not always possible to 
identify certain representations (such as phrases, proximities 
and thesaurus or synonym classes) or their frequency when 
creating the database. More particularly, phrases are usually 
comprised of multiple words which themselves are indi- 
vidual concepts or representations. The concept or repre- 
sentation of a phrase might be different from the concepts or 
representations of the individual words forming the phrase. 
For example, the phrase 'Independent contractor" is a dif- 
ferent concept than either of the constituent words "inde- 
pendent" and "contractor". Since it is not always possible to 
identify all possible phrases, or their frequency of occur- 
rence, during creation of the database, the use of phrases as 
a matching term in probabilistic networks has not been 
altogether successful Proximities (such as citations) and 
thesaurus and synonym classes have likewise not been 
"successful identifiers because of the inability to identify all 35 
synonyms, proximities and thesaurus classes during creation 
of the database or to pre-assign their frequencies. 

Techniques have been developed to identify phrases, 
synonyms, proximities and thesaurus classes as concepts in 
the query, and to find phrases, synonyms, proximities and 
thesaurus classes as representations in the documents. How- 
ever, no satisfactory technique exists for identifying the 
frequencies of occurrence of representations in the docu- 
ments and in the collection when the document collection is 
large and the frequencies of occurrence are not included in 
the database. 

Another difficulty with probabilistic networks is that for 
large databases, for example databases containing about 
one-half million documents or more, the processing 
resources required to evaluate a query have been too great to 
be commercially feasible. More particularly, probabilistic 
networks required that all representations for all documents 
in the collection containing at least one query term must be 
examined against all of the concepts in the query. Hence, 
probabilistic networks required extensive computing 
resources. While such computing resources might be rea- 
sonable for small collections of documents, they were not 
for large databases. There is, accordingly, a need to improve 
the processing of probabilistic networks to more efficiently 
employ the processing resources. 

For a more general discussion concerning inference net- 
works, reference may be made to Probabilistic Reasoning in 
Intelligent Systems: Networks of Plausible Inference by J. 
Pearl, published by Morgan Kaufmann Publishers, Inc., San 
Mateo, Calif, 1988, and to Probabilistic Reasoning in 



50 



55 



60 



Expert Systems by R. E. Neapolitan, John Wiley & Sons, 
New York, N.Y., 1990. 

GLOSSARY 

As used herein, the following alpha-numeric characters 
refer to the following terms: 



Character 



Term 



a. b. A, B 

d„ d* • 
D 



■Cm 



25 irrdn 

8 

30 

I 
i 

id* 

>d4a» 

idf^ 



45 



qi.q* 
Q 



sd 



Term or word in a query or document- 
Root or concept node in query 
network. 

Document node in a document 
network. 

Number of documents to be 
selected or identified to 
result list 

Concept frequency in collection 
(frequency, or number, of 
documents in collection 
containing concept t). 
Frequency of concept i in 
document j. 

Probable maximum frequency of 
documents in collection 
containing specific concept 
(maximum bound). 
Probable minimum frequency of 
documents m collection 
containing specific c 

Number of documents in 
collection between documents 
containing a representahon 
(gaps). 

Information peed in query 
network. 

Concept (an item of an 
information need). 
Inverse document frequency for 



Probable maximum inverse 
document frequency for con ce pt 

L 

Probable minimum inverse 
document frequency for concept 

i. 

Specific document (dj). . 
The maximum frequency fox any 
term occurring in dommrnt j. 
Number of documents in sample 
containing selected 



Number of documents in 
collection. 

Parent nodes to child node Q.- — 
Query nodes in query network. 
Child node to parent nodes P. 
Leaf or concept representation 
nodes in document network. 
A calculated number equal to 
greate r of x/nj and sd. 
Standard deviation. 
Sum of squares of gaps g. 
Interior terct *toric^ in document 
network. 

Probability estimate based on 
the frequency that c onc ept i 
appears in document j (based on 

Number of tpjinq in query. 
Number of duplicate terms 
removed from query. 
Term weights for parent nodes 
where w 8 is maximum. ■ 
Maximum term weight for child 
node Q, 0 § w t S 1. 
Number of documents in sample. 
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SUMMARY OF THE INVENTION 

According to one aspect of the present invention the 10 
frequency of occurrence of a selected representation in a 
collection of documents is estimated by identifying the 
frequency of occurrence of the representation in a sample of 
documents selected from the collection. Probable maximum 
and probable minimum frequencies of occurrence of the 15 
representation in the entire collection are calculated, and the 
midpoint of the probable maximum and minimum frequen- 
cies is selected. 

The estimated frequency of occurrence of the selected 
representation is set equal to the selected midpoint when the 20 
calculated difference between the probable maximum and 
minimum frequencies does not exceed a preselected limit If 
the preselected limit is exceeded, the sample of documents 
is adjusted to include additional documents from the col- 
lection, the sampling and calculating being repeated until the 25 
calculated difference between the probable maximum and 
minimum frequencies is within the preselected limit. 

The advantage provided by estimation of the frequency of 
representations such as phrases, synonyms, proximities and 3Q 
thesaurus classes is that the representations can be identified 
from the query itself and the frequencies can be accurately 
estimated without significantly affecting processing 
resources or the search results. Consequently, representa- 
tions such as phrases, synonyms, proximities and thesaurus 35 
classes can be employed as representation concepts, even in 
large databases. 

According to another aspect of the invention a sample is 
selected and the one document with the highest probability 
of meeting the information need defined by the query is 40 
identified from the sample of documents from the collection. 
In one form of the invention, a probability threshold is set 
equal to the probability that the selected document meets the 
information need When a predetermined number of addi- 
tional documents of the collection are identified as having a 45 
probability of meeting the information need which is greater 
than the probability threshold, the threshold is reset to the 
-probability of the selected document with the lowest calcu- — 
lated probability. Thereafter, as documents with higher prob- 
abilities are identified, the documents with the lowest prob- 50 
abilities are correspondingly removed. Upon completion of 
the search, the predetermined number of documents identi- 
fied as having the highest probabilities are retrieved, pref- 
erably in probability order. 

In another form of the invention, instead of employing the 55 
probability of the document selected from the first sample as 
a probability threshold, successive samples are iteratively 
selected, each successive sample containing documents dif- 
ferent from each previous sample. Up to a predetermined 
number of documents having the highest probabilities of 60 
meeting the information need are identified during each 
iteration, the documents being selected from a group con- 
sisting of the sample of documents selected for. the respec- 
tive iteration and the documents identified during the pre- 
vious iteration. Preferably, the predetermined number is 65 
equal to the number of the respective iteration, so there are 
as many iterations as there are documents to be selected 



FIG. 1 is a block diagram representation of a Bayesian 
inference network with which the present invention is used 

FIG. 2 is a block diagram representation of a simplified 
Bayesian inference network as in FIG. 1. 

FIG. 3 is a block diagram of a computer system for 
carrying out the invention. 

FIGS. 4A and 4B, taken together, are a flowchart and 
example illustrating the steps of creating a search query for 
a probabilistic network. 

FIG. 5 is a flowchart and example of the steps for 
deterrmning a key number for inclusion in the search query 
described in connection with FIG. 4. 

FIGS. 6A-*D are block diagram representations of illus- 
trating different techniques for handling phrases. 

FIGS. 7A and 7B, taken together, are a detailed flowchart 
identifying the steps for calculating the estimated inverse 
document frequency for a specific concept according to the 
present invention. 

FIG. 8 is a flowchart illustrating the manner by which 
partial phrases are handled in a document retrieval system 

FIG. 9 is a graph illustrating the principles of certain 
aspects of threshold estimating according to the present 
invention. 

FIG. 10 is a detailed flowchart identifying the steps for 
setting probability thresholds and optimizing document 
retrieval according to the present inventioa 

FIG. U is a detailed flowchart illustrating the maximum 
score optimization techniques according to the present 
invention. 

FIG. 12 is a detailed flowchart of the process for creating 
the query network for a probabilistic information retrieval 
network. 

FIG. 13 is a detailed flowchart of the process for evalu- 
ating a document network used with the query network 
shown in FIG. 12. 

DETAILED DESCRIPTION OF THE PREFERRED 
EMBODIMENT 

The Probability Network 

Inference probability networks employ a predictive prob- 
ability scheme in which parent nodes provide support for 
their children. Thus, the degree to wMch belief eausts in a_ 
proposition depends on the degree to which belief exists in 
the propositions which potentially caused it This is distinct 
from a diagnostic probability scheme in which the children 
provide support for their parents, that is belief in the 
potential causes of a proposition increases with belief in the 
proposition. In either case, the propagation of probabilities 
through the network is done using information passed 
between adjacent nodes. 

FIG. 1 illustrates a Bayesian inference network as 
described in the aforementioned Turtle et al. article. The 
Bayesian network shown in FIG. 1 is a directed acyclic 
dependency graph in which nodes represent pro positional 
variables or constraints and the arcs represent dependence 
relations between propositions. An arc between nodes rep- 
resents that the parent node "causes" or implies the propo- 
sition represented by the child node. The child node contains 
a link matrix or tensor which specifies the probability that 
the child node is caused by any combination of the parent 
nodes. Where a node has multiple parents, the link matrix 
specifies the dependence of that child node on the set of 
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Each document node has a prior probability associated 
with it that describes the probability of observing that 
document. The document node probability will be equal to 
l/(coHection size) and will be small for most document 
collections. Each text node contains a specification of its 
dependence upon its parent By assumption, this dependence 
is complete (t, is true) when its parent document is observed 
(d, is true). Each representation node contains a specification 
of the conditional probability associated with the node given 
its set of parent text nodes. The representation node incor- 
porates the effect of any indexing weights (for example, term 
frequency in each parent text) or term weights (inverse 
document frequency) associated with the concept. 

The query network 12 is an "inverted" directed acyclic 
graph with a single node I which corresponds to an infor- 
mation need. The root nodes c lf c^, C3, . . . c m are the 
primitive concept nodes used to express the information 
requirement A query concept node, c, contains the specifi- 
cation of the probabilistic dependence of the query concept 
on its set of parent representation content nodes, r. The query 
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parents and characterizes the dependence relationship 
between the node and all nodes representing its potential 
causes. Thus, for all nodes there exists an estimate of the 
probability that the node takes on a value given any set of 
values for its parent nodes. If a node a has a set of parents 5 
xx={Viy • • • Pn)> me estimated probabilities PCalpj, . . . pj 
are determined. 

The inference network is graphically illustrated in FIG. 1 
and consists of two component networks: a document net- 
work 10 and a query network 12. Hie document network 10 
consists of document nodes d^ d^. . . d,.,, d,, interior text 
representation nodes t l9 . . . t / . 1 , t^ and leaf nodes r^ r 2 , 
r 3 , . . . r*. The document nodes d correspond to abstract 
documents rather than their physical representations. The 
interior nodes t are text representation nodes which cone- 15 
spond to specific text representations within a document 
The present invention will be described in connection with 
the text content of documents, but it is understood that the 
network can support document nodes with multiple children 
representing additional component types, such as audio, 20 
video, etc. Similarly, while a single text may be shared by 
more than one document, such as journal articles that appear 
in both serial issue and reprint collections, and parent/ 
divisional patent specifications, the present invention shall 
be described in connection with a single text for each 25 
document Therefore, for simplicity, the present invention 
shall assume a one-to-one correspondence between docu- 
ments and texts. 

The leaf nodes r are content representation nodes. There 
are several subsets of content representation nodes r^ r 2 , r 3 , 30 
. . . r^ each corresponding to a single representation tech- 
nique which has been applied to the document texts. If a 
document collection has been indexed employing automatic 
phrase extraction and manually assigned index terms, then 
the set of representation nodes will consist of distinct subsets 35 
or content representation types with disjoint domains. For 
example, if the phrase "independent contractor" has been 
extracted and 'Independent contractor" has been manually 
assigned as an index term, then two content representation 
nodes with distinct meanings will be created, one corre- 40 
spending to the event that ''independent contractor" has been 
automatically extracted from the subset of the collection, 
and the other corresponding to the event that "independent 
contractor" has been manually assigned to a subset of the W 2 
collection. As will become clear hereinafter, some concept 45 
representation nodes may be created based on the content of 
the_query_network._ 



concept nodes Ci . . . c m define the mapping between the 
concepts used to represent the document collection and the 
concepts that make up the queries. A single concept node 
may have mare than one parent representation node. For 
example, concept node Cj may represent the query concept 
"independent contractor" and have as its parents represen- 
tation nodes r 2 and r 3 which correspond to "independent 
contractor" as a phrase and as a manually assigned term. 

Nodes q,, are query nodes representing distinct query 
representations corresponding to the event that the indi- 
vidual query representation is satisfied. Each query node 
contains a specification of the query on the query concept it 
contains. The intermediate query nodes are used in those 
cases where multiple query representations express the 
information need I. 

As shown in FIG. 1, there is a one-to-one correspondence 
between document nodes, d, and text nodes, t Consequently, 
the network representation of FIG. 1 may be diagrammati- 
cally reduced so that the document nodes d lT d^ , . . . d M , d t 
are parents to the representation nodes r lv r 2 , r 3 , . . . r*. In 
practice, it is possible to further reduce the network of FIG. 
1 due to an assumed one-to-one correspondence between the 
representation nodes r lt r 2 , r 3 , . . . r*, and the concept nodes 
c i# °2» °3« • • • On* The simplified inference network is 
illustrated in FIG. 2 and is more particularly described in the 
article by Turtle et al., "Efficient Probabilistic Inference for 
Text Retrieval," RIAO 91 Conference Proceedings, pp. 
644-661, April, 1991 (Recherche d' Informal on Assistee par 
Ordinateur, Universitat Aut6noma de. Barcelona, Spain), 
which article is herein incorporated by reference. 

As described above, each child node carries a probability 
that the child node is caused by the parent node. The 
estimates of the dependence of a child node Q on its set of 
parents, P lf Pj, . . . P n , are encoded using the following 
expressions: 



Pn 



.MmJS> m Pl P2 P3- 



EQl 
EQ2 
EQ3 
EQ4 



W] +W2 + WJ + . . 

where P(P 1 =true)=p 1 , P(P 2 =true)=p 2 , . 



,P(P s =true)=p JI ,w lf 



. W n are the term weights for each term P„ P 2 , 



P", and w, is the maximum probability that the child node 
can achieve, O^W^l. 

As_described_above,_all child nodes carry a probability- 

that the child was caused by the identified parent nodes. The 
structure of document network 10 is not changed, except to 
50 add documents to the database. The document nodes d and 
text nodes t do not change for any given document once the 
document representation has been entered into document 
network 10. Most representation nodes are created with the 
database and are dependent on the document content Some 
55 representation nodes (representing phrases and the like) are 
created for the particular search being conducted and are 
dependent on the search query. 

Query network 12, on the other hand, changes for each 
input query defining a document request Therefore, the 
60 concept nodes c of the search network are created with each 
search query and provide support to the query nodes q and 
the information need, node I (FIG. 1). 

Document searching can be accomplished by a document- 
based scan or a concept-based scan. A document-based scan 
65 is one wherein the text of each document is scanned to 
determine the likelihood that the document meets the infor- 
mation need, I. More particularly, the representation nodes 
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r i> r n h* • • • r * °f a single document are evaluated with encoded for the purposes described. Computer 20 may be a 

respect to the several query nodes q u to determine a personal computer (PC) and may be optionally connected 

probability that the document meets the infonnation need. through modem 26, telephone communication network 28 

The top D-ranked documents are then selected as potential and modem 30 to a central computer 32 having a memory 

information need documents. The scan process reaches a 5 34. in one form of the invention, the document network 10 

point, for example after assigning a probability for more and the document database containing the texts of docu- 

than D documents of a large document collection, that ments represented by the document network are contained in 

documents can be eliminated from the evaluation process the central computer 32 and its associated memory 34. 

after evaluating subsets of the representation nodes. More Alternatively, the entire network and database may be resi- 

particularly, if a given document scores so low of a prob- 10 dent in the memory of personal computer 20 and ROM 24. 

ability after only evaluating one or two representation nodes, in a legal database and document information retrieval 

determination can be made that even if the evaluation network the documents may comprise, for example, deci- 

continued the document still would not score in the top sions and orders of courts and government agencies, rules, 

D-ranked documents. Hence, most documents of a large statutes and other documents reflecting legal precedent By 

collection are discarded from consideration without having 15 maintaining the document database and document network 

all their representation nodes evaluated. at a central location, legal researchers may input documents 

A concept-based scan is one wherein ail documents into the document database in a uniform manner. Thus, there 

containing a given representation node are evaluated. As the may be a plurality of computers 20, each having individual 

process continues through several representation nodes, a ROMs 24 and input/output devices 22, the computers 20 

scorecard is main t ai n ed of the probabilities that each docu- 20 being linked to central computer 32 in a time- sharing mode, 

ment meets the information need, L More particularly, a The search query is developed by each individual user or 

single representation node r t is evaluated for each document researcher and input via the respective input/output terminal 

in the collection to assign an initial probability that the 22. For example, input/output terminal 22 may comprise the 

document meets the concept. Hie process continues through i npu t keyboard and display unit of PC computer 20 and may 

the several representation nodes with the probabilities being 25 include a printer for printing the display and/or document 

updated with each iteration. The top D-ranked documents texts. 

are then selected as potential information need documents. ROM 24 contains a database containing phrases unique to 

If at some point in the process it can be determined that the specific profession to which the documents being 

evaluation of additional representation concepts will not searched are related In a legal search and retrieval system 

alter the ranking of the top D-ranked documents, the scan 30 ^ described herein, the database on ROM 24 contains 

process can be terrmnated. stemmed phrases from common legal sources such as 

It can be appreciated that the representation nodes r lf r 2 , Black's or Statsky's Law Dictionary, as well as common 

r 3 , . . . r k are nodes dependent on the content of the texts of names for statutes, regulations and government agencies, 

the documents in the collection. Most representation nodes ROM 24 may also contain a database of basic and extended 

are created in the document database. Other representation 35 stopwords comprising words of imfefinitft direction which 

nodes, namely those associated with phrases, synonyms and may be ignored for purposes of developing the concept 

citations, are not manifest in any static physical embodiment no des of the search query. For example, basic stopwords 

and are created based cm each search query. Because the user included in the database on ROM 24 includes indefinite 

can define phrases and thesaurus relationships when creating articles such as "a", "an", "the", etc. Extended stopwords 

the query, it is not possible to define all combinations in a 40 include prepositions, such as "of, •'under", "above", "for", 

static physical embodiment For example, a query manifest- •'with", etc., indefinite verbs such as ,4 is", "are", "be", etc. 

ing the concept "employee" may be represented by one or and indefinite adverbs such as "what", "why", "who", etc. 

more of "actor", "agent", "attendant* \ "craftsman", "doer". The database on ROM 24 may also include a topic and key 

"laborer", "maid", "servant", "smith", "technician" and database such as the numerical keys associated with the 

"worker", to name a few. These various representation nodes 45 well-known West Key Digest system, 
may be created from the query node at the time of the search, FIGS. 4A and 4B are a flow diagram illustrating the 

such as through the use of thesauri and other tools jo be rmocess steps and theoperation on the example given above 

described, as well as through databases. A query node q^ in the development of the concept nodes c. The natural 

etc. can be manifest in one or more representations. language query is provided by input through input terminal 

The Search Query 50 22 to computer 20. In the example shown in FIG. 4, the 

The present invention will be described in connection natural language input query is: 
with a database for searching le^ documents, but it is to be "What is the liability of the United States under the 

understood the concepts of the invention may be applied to Federal Tort Claims Act for injuries sustained by employees 

databases for searching other types or classes of documents. 0 f an independent contractor working under contract with an 

The invention will be described in connection with a specific 55 agency of the United States government? " 
search query as follows: By way of example, a corresponding WESTLAW Bool- 

"What is the liability of the United States under the ean query might be: 
Federal Tort Claims Act for injuries sustained by "UNITED STATES" U.S. GOVERNMENT (FEDERAL 

employees of an inderjendent contractor working under 12 GOVERNMENT) /P TORT/2 CLAIM/P INJUR! I? 

contract with an agency of the United States govern- 60 EMPLOYEE WORKER CREWMAN CREWMEMBER /P 

mentr INDEPENDENT/2 CONTRACTOR 
The present invention is carried out through use of a As shown in FIG. 4A, the natural language query shown 

computer system, such as illustrated in FIG. 3 comprising a in block 40 is inputted at step 50 to computer 20 via 

computer 20 connected to an input/output terminal 22 and a input/output terminal 22. The individual words of the natural 

read only memory (ROM) 24. ROM 24 may be any form of 65 language query are parsed into a list of words at step 50, and 

read only memory, such as a CD ROM, write protected at step 54 each word is compared to the basic stopwords of 

magnetic disc or tape, or a ROM, PROM or EPROM chip the database in ROM 24. At step 54, the basic stopwords 
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such as "the" are removed from the list. The extended 
stopwords are retained for phrase recognition and remaining 
extended stopwords will be removed after phrase recogni- 
tion, described below. 

At step 56, the remaining words are stemmed to reduce 5 
each word to its correct morphological root One software 
routine for stemming the words is based on that described by 
Porter "An Algorithm for Suffix Stripping", Program, Vol. 
14, pp 130-137 (1980). As a result of step 56 a list of words 
is developed as shown in block 42, the list comprising the 10 
stems of all words in the query, except the basic stopwords. 
Phrases 

Previous systems recognized linguistic structure (for 
example, phrases) by statistical or syntactic techniques. 
Phrases are recognized using statistical techniques based on is 
the occurrence of phrases in the document collection itself; 
thus, proximity, co-occurrence, etc. were used. Phrases are 
recognized using syntactic techniques based on the word/ 
term structure and grammatical rules, rather than statisti- 
cally. Thus, the phrase "independent contractor" could be 20 
recognized statistically by the proximity of the two words 
and the prior knowledge that the two words often appeared 
together in documents. The same term could be recognized 
syntactically by noting the adjective form "independent" and 
the noun form "contractor*' and matching the words using 25 
noun phrase grammatical rules. (Manual selection systems 
have also been used wherein the researcher manually rec- 
ognizes a phrase during input) 

Previous inference networks employed a two-term logical 
AND modeled as the product of the beliefs for the individual 30 
terms. Beliefs (probabilities) lie in the range between 0 and 
1, with 0 representing certainty that the proposition is false 
and 1 representing certainty that the proposition is true. The 
belief assigned to a phrase is ordinarily lower than that 
assigned to either component term. However, experiments 35 
reveal that the presence of phrases represents a belief higher 
than the belief associated with either component term. 
Consequently, separately identifying phrases as independent 
representation nodes significantly increases the performance 
of the information retrieval Bystem. However, single terms 40 
of an original query are retained because many of the 
concepts contained in the original query are not described by 
phrases. Experimentation has suggested that eliminating 
single terms significantly degrades retrieval performance 
even though not all single terms from an original query are 45 
required for effective retrieval. 

As previously described, the phrase relationshi ps in the 



search query are recognized by domain-knowledge based 
techniques (e.g., the phrase database), and by syntactic 
relationships. The primary reason to solely select syntactical SO 
and domain-based phrases for purposes of the query network 
is to reduce user involvement in identifying phrases for 
purposes of creating a query. 

An example of a domain-knowledge database is a data- 
base containing phrases from a professional dictionary. This 55 
type of phrase handling is particularly suitable for profes- 
sional information retrieval where specialized phrases are 
often employed. 

At step 58 in FIG. 4B , computer 20 returns to the database 
in ROM 24 to determine the presence of phrases within the 60 
parsed and stemmed list 42. Hie phrase database in ROM 24 
comprises professional, domain-specific phrases (such as 
from Black's Law Dictionary) which have been stemmed in 
accordance with the same procedure for stemming the words 
of a search query. Computer 20 compares the first and 65 
second words of list 42 to the database of phrases in ROM 
24 to find any phrase having at least those two words as the 



first words of a phrase. Thus, comparing the first two terms 
"WHAT" and "IS" to the database of phrases (such as 
Black's Law Dictionary), no match is found. Thus, as shown 
in block 44, "WHAT' is retained for the search query. Hie 
next two words "IS" and "ULABL" are compared to the 
database of phrases and no phrase is found. When "UNITE" 
and STATE" are compared to the database, a phrase match 
is found. The next word "FEDERAL" is then compared to 
the database to determine if it corresponds to the third word 
of any phrase commencing with "UNITE STATE". In this 
case no phrase is found, so both "UNITE" and "STATE" are 
removed from the list 44 and substituted with a phrase 
representing the term "UNITE STATE". When the terms 
"FEDERAL" and "TORT" are compared to the database a 
match is found to phrases in the database. The third and 
fourth words "CLAIM" and "ACT" also compare to at least 
one phrase commencing with "FEDERAL" and "TORT". 
Consequently, each of the terms "FEDERAL", "TORT', 
"CLAIM" and "ACT' are substituted with the phrase "FED- 
ERAL TORT CLAIM ACT". (As explained below, if a word 
is found to be included in a successive phrases, the common 
word would be assigned to the longer phrase, if they have an 
unequal number of terms, or to the first phrase of the 
succession, if the number of terms in the phrases are equal.) 
The process continues to substitute phrases from the data- 
base for sequences of stemmed words from the parsed list 
42, thereby deriving the list 44. 

The phrase lookup is accomplished one word at a time. 
The current word and next word are concatenated and used 
as a key for the phrase database query. If a record with the 
key is found, the possible phrases stored under this key are 
compared to the next word(s) of the query. As each phrase 
is found, a record of the displacement and length of each 
found phrase is recorded. 

The extended stopwords are included in the phrase match- 
ing technique because the phrases themselves contain such 
stopwords. For example, phrases like "doctrine of equiva- 
lents" and "tenancy at will" contain prepositions which are 
stopwords. 

As indicated above, once successive terms have been 
identified as a phrase, the individual terms do not appear in 
the query shown at block 44 in FIG. 4B. In rare cases two 
phrases might seemingly overlap (i.e., share one or more of 
the same words). In such a case, the common word is not 
repeated for each phrase, but instead preference in the 
overlap is accorded to the longer phrase. For example, if a 
natural la nguage search query contained ". . . tenancy at will, 
the power of which . . . ", the parsed and stemmed list (with 
basic stopwords removed) would appear as: "tenan", "at", 
"will", "power", "of \ "which". The database could identify 
two possible phrases: "tenan at will" and "will power" with 
"will" in both phrases. As will be explained below, prefer- 
ence is accorded to the longest possible phrase, so the 
identified phrase will be "tenan at will". 

With the phrases identified, as at 44, the remaining 
extended stopwords ("what", "is", "or, "under", "for", 
"by", "with'*) are removed at step 62, and any duplicate 
terms are removed at step 64, to be described in greater 
detail below. The result is the final query shown at block 46 
in FIG. 4B. 
Citations 

Case citations, U.S. Code citations and citations to the 
Code of Federal Regulations (CFR) are handled as exact 
terms. Other citations, including subsection citations, are 
handled syntactically using word-level proximity as single 
terms or query nodes comprising numeric tokens. For 
example, a citation to Volume 78 Columbia Law Review 
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page 1587 is encoded as 78 -h4 1587 (meaning 78 within four 
words of 1587), and the citation to 17 U.S.C 106A(e)(l) is 
encoded as 17+2 106A(e)(l). To encompass most citations, 
it is preferred to encode all citations as within five words. 
Hence, the above two citations will be encoded as 78+5 5 
1587 and 17+5 106A(e) (1). 
Hyphenations 

Hyphenated terms in search queries are handled in much 
the same manner as citations. The hyphen is removed and 
the component words are searched using an adjacency 10 
operation which finds all adjacent occurrences of the com- 
ponent words. 
Synonyms 

Synonyms comprise equivalent words and misspelling s 
and arejggflted fmm a prraterlTifiri database stored in ROM 15 
24 (FIG. 3). Examples of equivalencies include 2d/2ndV 
second whereas examples of misspellings include habeas/ 
habeus. Where a search query includes a word having a 
synonym, a new representation node r (FIG. 2) is created for 
each synonym. However, the weight associated with the 20 
node is based on the frequency of the entire class of nodes 
comprising all synonyms, rather than any one term of the 
class. 

Duplicate terms 

Where a single word, term or phrase occurs more than 25 
once in a query, the word, term or phrase is evaluated only 
- once. After the word, term or phrase has been processed for 
phrase identification as heretofore described, the duplicate 
word, term or phrase is simply dropped from the search 
query. As will be explained hereinafter, the component 30 
probability score for each document containing a term 
duplicated in the query is multiplied by the query frequency, 
and the query normalization factor is increased by that 
frequency. Thus, the effect is that the duplicated term is 
evaluated multiple times as dictated by the query, but in a 35 
computationally simpler manner. 
Thesaurus Classes 

Thesauri are employed to identify words of similar or 
related meaning, as opposed to synonyms having identical 
meaning. The thesauri are used to suggest broader, narrower 40 
and related terms to the researcher for inclusion in the search 
query. These relationships can be drawn from the machine 
readable dictionaries (such as Black's Law Dictionary) 
encoded in databases, or from manually recorded domain 
knowledge. 45 
Document Retrieval 

One fe ature of p robabilistic informaiio n retri e val systems 
is that the documents in the document collection are ranked 
in accordance with the probability that the document meets 
the information need identified in the query. This permits 50 
selection of a predetermined number of documents having 
the highest probabilities for identification and retrieval. For 
a given information need, for example, it may be desirable 
to retrieve 20 documents from a document collection of 
500,000 documents. A probabilistic information retrieval 55 
network can identify for retrieval the 20 documents having 
the highest probability of meeting the information need. 

Phrases, synonyms, proximities and thesaurus classes are 
not separately permanently identified in the document net- 
work: Instead, the representation nodes in the document 60 
network are created for the phrase, synonym, proximity or 
thesaurus class by those concept nodes (FIG. 1) which 
themselves are a function of the phrase or term in the query. 

FIGS. 6A-6D illustrate different treatments of phrases in 
the document network of an inference network. Represen-l 65 
tation con cepts r t and r 2 shown in FIGS. 6A-6D correspond! 
to two^words in the text of document c^, Representation 



concept r 3 corresponds to the ph rase in the text consisting_Qf 
the tw^^ords. Q represents the query. For example,! 1 ! and- 
r 2 rfiaycorrespond to the occurrence of the terms "indepen- 
dent" and "contractor", respectively, while r 3 corresponds to 
the occurrence of the phrase "independent contractor". In the 
model illustrated in FIG. 6A (which is the preferred model), 
the phrase is treated as a separate representation concept, 
independent of the concepts corresponding to the component 
words. The belief in the phrase concept can be estimated 
using evidence about component words and the relationship 
between them, including linguistic relationships. The pres- 
ence of the query phrase concept in the document increases 
the probability that the document satisfies the query (or 
information need). The model of FIG. 6B illustrates the case 
where the belief in the phrase concept depends on the belief 
in the concepts corresponding to the two component words. 
FIG. 6C illustrates a term dependence model where the 
phrase is not represented as a separate concept, but as a 
dependence between the concepts corresponding to the 
component words. A document that contains both words will 
more likely satisfy the query associated with the phrase due 
to the increase belief coming from the component words 
themselves. However, experimentation has revealed that the 
model of FIG. 6C is less appropriate for phrases and more 
appropriate for thesauri and synonyms. In FIG. 6D belief in 
the phrase concept is established from evidence from the 
document text itself, whereas belief in the concepts repre- 
senting component words are derived from belief in the 
phrase itself. The model of FIG. 6D makes explicit the 
conditional dependence between the component concepts 
and addresses the practice of some authors that all compo- 
nent words of a phrase might not always be used in the text 
representation of a document For the present purposes, it is 
preferred that document network 10 employ the phrase 
model of FIG. 6A so that the representation concepts for the 
phrases are independent of me corresponding words. Hence, 
a match between the concept node of a search query and the 
concept node of a documentation representation is more 
likely to occur where the search query contains only the 
phrase, and not the component words. It is understood that 
the other models (FIGS. 6B-6D) could be employed with 
varying results. 

Thus far, there has been described techniques for obtain- 
ing lists containing single words, phrases, proximity terms 
(hyphenations and citations) and key numbers. These ele- 
ments represent the basic concept nodes contained in the 
query. The phrases, hyphenations and citations create rep- 
resentation nodes of the document network. Computer 20 
(FIG73) forwards~the searcbTquery to^rnputer r 32Twhich 
aeterrnines the probability that a document containing some 
subset of these concepts matches the original query. For each 
single document, the individual concepts represented by 
each single word, phrase, proximity term, and key number 
of the query are treated as independent evidence of the 
probability that the document meets the information need, I. 
The probability for each concept is determined separately 
and combined with the other probabilities to form an overall 
probability estimate. 

The probabilities for individual concepts are based on the 
frequency with which a concept occurs in document j (tfy) 
and the frequency (f,) with which documents containing the 
concept (i) occur in the entire collection. The collection 
frequency may also be expressed as an inverse document 
frequency (idfi). The inference network operates on two 
basic premises: 

A concept that occurs frequently in a document (a large 
tfy) is more likely to be a good descriptor of that 
document's content, and 
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A concept that occurs infrequently in the collection (a 
large idfj) is more likely to be a good discriminator than 
a concept that occurs in many documents. 
It can be shown that the probabUity, P(c f -!cy) that concept 
c f is a "correct" descriptor for document d^ may be repre- 
sented as 



log 



EQ5 
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$ = 0,5 + 0.5 • 



log maxfj 



i * 
tog* 
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if f tf is less than max f p where n^. is the number of documents 
in the collection, f tj is the frequency of concept i in document 
j, f t is the frequency of documents in the collection contain- 
ing term i (i.e., the number of documents in which term i 
occurs), and max is the maximum frequency for any term 
occurring in document j. If f^ is not less than max f}, then 
tf^is set to 1. 

Most document networks for search and retrieval are 
represented by a word index containing words from the 
documents to be matched to query terms. In Boolean net- 
works, relationships were determined from the word index 
and offset data therein to locate documents meeting the 
logical criteria of the query. The present invention employs 
a probabilistic network in which the same database and word 
index may be employed to calculate the probabilities set 
forth in Equation 5 for many of the query concepts. The 
number of documents in the collection, n^ is known from 
the document addresses associated with words in the word 
index. lb calculate f,, the number of documents in the 
collection containing concept i is determined by locating and 
counting the addresses of all documents in the database 
containing the concept More particularly, the document 
addresses associated with each word in the word index 
corresponding to the concept are compared to remove dupli- 
cate addresses and the remaining number of document 
addresses is summed. The resulting sum is f t . The frequency 
or number of times, fy, that concept i appears in document 
j can be calculated from the number of offset codes for the 
word (and its synonyms) associated with the document 

Hence, the Jermsjdf, J andjf ( ,._can be calailated, Jhereby. 

leading to the probability factor, P(c f ld,), for the concept for 
the document in accordance with Equation 5. However, this 
technique is useful only for those concepts whose concept 
frequency is represented in the word index. Certain con- 
cepts, such as phrases, are not ordinarily so represented, so 
it is an aspect of the present invention to provide a technique 
to estimate the representation concept frequency for such 
concepts. 

Representation Concept Frequency Estimation 

The inverse document frequency (idf^) is predetermined 
for each representation concept in the document collection, 
except certain representations such as phrases, synonyms, 
proximities and thesaurus classes. For phrases, synonyms, 
proximities and thesaurus classes, the inverse document 
frequency is computed for each search. Identifying the 
inverse document frequency for a given phrase, synonym, 
proximity or thesaurus class requires processing through 
each document in the collection. In small collections, the 
computation of the inverse document frequency of a phrase, 
synonym, proximity, or thesaurus class may be performed 
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without significant difficulty by examination of the word 
index to determine f„ n c and f (J as described above. Hence, 
the inverse document frequency for the phrase may be 
calculated using equation 7. However, in the case of large 
collections (of the order of 500,000 documents), computa- 
tion of the inverse document frequency for a phrase, syn- 
onym, proximity or thesaurus class representation requires 
significant processing, if all documents containing a query 
concept are to be examined. Moreover, in many circum- 
stances the computation may lead to a result which is too 
insignificant to affect the ranking. 

Consider, for example, a synonym class containing terms 
A and B where term A occurs in 10,000 documents in the 
collection of 500,000 documents and term B occurs in 10 
documents. The frequency of the synonym class lies in the 
range of 10,000 to 10,010, resulting in a frequency differ- 
ence of 10 documents in 10,010 or about 0.1%. Conse- 
quently, the range of the inverse document frequency, idf £ , 
lies between about 0.02000 and 0.02002, which is too small 
to significantly affect the result ranking. However, if term A 
appears in 10,000 documents and term B appears in 4,000 
documents, the frequency is in the range of 10,000 and 
14,000 , leaving a 28.6% frequency difference and a range 
of document inverse frequencies between 0.02000 and 
0.02800, which is significant 

One aspect of the present invention concerns the estima- 
tion of the inverse document frequency for a selected 
representation, such as a phrase, proximity, synonym or 
thesaurus class. More particularly, the representation fre- 
quency is estimated from a sample of the collection with 
sufficient accuracy, while avoiding extended computational 
resources in the evaluation of the entire collection. A sample 
of a plurality of documents is selected from the collection, 
and the representations in the sample documents are pro- 
cessed to identify the frequency mat the selected represen- 
tation occurs in the sample. Specifically, the "gaps," or the 
number of documents (g) occurring between occurrences of 
documents containing the selected representation, are iden- 
tified, and the sum of the squares of the gaps (sq) are 
employed to estimate the correct representation frequency. 
The gaps are identified from the successive addresses of 
documents containing the concept as determined from the 
word index of the document database. The sequence of 
observed gaps are employed to estimate the maximum and 
minimum bounds (f^ and f min ) of the true frequency within 
a preselected error rate. The frequency bounds are employed 
to compute the range of the probable inverse document 
.frequency. _When that range_becomes_suffirieotly_narrow_as_ 
to insignificantly affect the result ranking, the midpoint of 
the frequency range is selected as the estimated frequency of 
occurrence of the selected representation 

After computing the frequency bounds for the given 
sample, if the difference between the bounds is too large that 
the selection of the midpoint as the estimated frequency of 
occurrence is likely to affect the result ranking, the sample 
is enlarged to include additional documents, and the fre- 
quency bounds are again computed. Ordinarily, mean and 
variance estimations are computed on the basis that each 
sample is independent, but in the present case the samples 
may not be independent because samples are taken sequen- 
tially, rather than randomly. lb adjust , for possible non- 
random sampling, the variation for the frequency bounds is 
estimated in two ways: first based on random sampling, and 
second based on gaps (numbers of documents found 
between documents containing the representation). . The 
probable maximum frequency, f mar , and the probable mini- 
mum frequency, f m/n1 are computed in accordance with the 
following algorithms: 
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where 

n4s the number of documents (or gaps between docu- 
ments) in the sample containing the selected represen- 
tation, 

mis the number of documents in the collection, 
x f is the number of documents in the sample, 
s^s the greater of xi/n, or sd of the n, gaps, and 
z is the standard critical value for normal distribution for 
a preselected reliability, 
and where sd is the standard deviation and is represented by 
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where sq is the sum of the squares of the gaps, or the sum 
of the squares of the numbers of documents found between 
documents containing the representation. 

It is preferred that the reliability of the estimation be 
within 0.95 (ie., the maximum error rate should not exceed 
5%). It can be shown that the standard critical value (z) for 
a normal distribution of the documents of the collection, 
within a 0.95 reliability, is 2.8070. 

There are several constraints on the calculation of f^ 
and f^. First, if f^ is smaller than the a priori minimum, 
then is set equal to the a priori minimum, and if is 
greater than the a priori maximum, then i max is set equal to 

the a priori maximum. To illustrate the a priori minimums 35 document frequency, idf„ for the selected concept 
and maximums, assume a synonym class containing terms A 
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The number of documents x f in the sample necessary to 
estimate the frequency of the selected representation is 
increased until the difference between the inverse document 
frequencies of the maximum and minimum bounds is 
smaller than some prescribed amount 

While the specific limit of the difference between the 
maximum and minimum inverse document frequencies is 
heuristic it has been found that when the range of frequency 
values between f^ and f^ is so small that further refine- 
ment would not significantly alter the ranking of the ulti- 
mately selected documents, further computation of an esti- 
mated probable frequency for the selected representation 
may be halted. For purposes of the present invention, an 
inverse document frequency (idf f ) difference of 0.05 or less 
as an empirically selected stopping point, provides good 
results. The estimated inverse document frequency for the 
selected representation is thereupon selected at the mean 
between the maximum and minimum bounds. If the maxi- 
mum and minimu m bounds are accurate, they would each be 
located at a maximum error of 0.025 which is deemed 
acceptable for the present purposes. In practice, the correct 
frequency error is usually smaller than 0.025 because the 
correct frequency tends to lie in the center of the estimated 
range more often than near either the maximum or minimum 
bound. Tests have indicated that the average error for the 
estimated frequency for the selected representation is about 
0.01. 

FIGS. 7A and 7B, taken together, comprise a detailed 
flowchart illustrating the steps of estimating the frequency of 
a selected concept, such as a phrase, synonym, proximity or 
thesaurus class. The process illustrated in FIG. 7A and 7B is 
carried out by a computer, which calculates the probable 
m axi m um and minimum frequencies f^ and f^ shown in 
Equations 8 and 9 and calculates the estimated inverse 



and B wherejerrnj^irmc ars in 1U.UUU docu ment aflfl term 
B appeajaia AQOO docum ents . Terms A and B could appear 
in the same or overlapping documents, meaning that term B 
could appear in as many as 4,000 documents with term A. 
Conversely, term B might appear in documents exclusive of 
term A. Consequently, although the actual occurrences of the 
synonym class is unknown, the synonym class appears in the 
range of 10,000 to 14,000 documents. Hence, an a priori 
minimum number of occurrences can be established at 
10,000 (the number , of occurrences of the most common 
term A), and an_a priori maximum number of occurrences 
can be established at 14,000 (the sum of occurrences of both 
terms A and B). Similarly, in the case of a phrase containing 
two terms A and B (such as '^independent contractor' if A 50 
appears in 10,000 documents and B appears in 4,000 docu- 
ments, an a priori maximum exists of 4,000 (the number of 
occurrences of the least common term B) because that is the 
maximum that the two terms could appear together. — \ 

Hence, the a priori maximums and minimums are derived ss 
from the pre-identified frequencies f, of individual terms 
(which form or are part of the concept) in the collection, and 
the type of concept (synonym, phrase, thesaurus or proxim- 
ity). ' 

Another constraint concerning the calculation of f mln is 60 
that if the calculated f^ is smaller than n* (the number of 
documents in the sample containing the representation), 
is set equal to n*. Likewise, if the calculated f^ is smaller 
than zero or is less than n £ , f^ is set equal to n^-Kn^-xJ (the 
number of documents in the sample containing the repre- 65 
sentation plus the number of documents of the collection yet 
to be considered). 



Of> At step 70, the number of documents in the sample (x £ ), 
the number of documents in the sample containing the 
selected representation (n*), the gap size (g), and the sum of 
the squares of the gaps (sq), are each initialized to 0. At step 
72, 1 is added to x,- and at step 74 the increased x, is 
compared to n^ the number of documents in the entire 
collection. If x, is_ smaller tha n n^ t he first document j is 
examinedal step 76 to dete rmine whether or not concept i 
ap pears in the documentTlf lEe^concept does not~armearin~ 
th e m^tjjemm^nt^ 1 is added to g at step 78 and the 
sequence loops back through point 80 to increment x, by 1. 

Th e process continues to loop until a docume nt is i dentified 

containing concept i at step 76. By that point, the value of 
g has been incremented and is equal to the number of 
documents not containing concept i since identifying the 
previous document containing concept i. At step 82, n, is 
incremented by 1, and at step 84 g 2 is calculated and is added 
to sq at step 86. At step 88 g is reset to 0. 

lb conserve computing resources, it is preferred that f mnr 
and f^ not be calculated each time a document is located 
containing concept i. Instead, it is preferred that a decision 
be made at step 90 which inhibits calculation of f^ and f^ 
until after only a predeterrnined number of documents 
containing the concept are identified. This has two effects: 
first, it conserves computing resources, and second, it per- 
mits use of the actual inverse document frequency (idQ for 
those concepts not appearing often in the collection. More 
particularly, it is preferred that a fixed number of documents, 
such as 25, be found containing concept i between each 
calculation of f^ and f^. Thus, at step 90 n, is divided by 
25 and if the result is a whole number (indicating that n^ is 
25, 50, 75, etc.), then the process continues through steps 92, 
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94 and 96 to calculate f max and f^. On the other hand, if n, 
is not equal to 25, 50, 75, etc., the process loops back 
through point 80 to continue to identify concept i in addi- 
tional documents. 

At step 92, Xj/n, and sd are calculated, sd being calculated 5 
in accordance with equation 10. At step 94, s, is set to the 
greater of x/% or sd. At step 96, f^ and f min are calculated. 

It should be noted that g is the size of the gap or the 
number of successive documents not containing the concept 
between documents that do contain the concept Thus, g is 10 
incremented at step 78 for each document not containing the 
concept and is reset at step 88 upon finding a document 
which does contain the concept. Term sq calculated at step 
86 is the sum of the squares of the gaps g. 

After the maximum and minimum estimated bounds, f^ 15 
and f min , are computed, maximum and minimum inverse 
document frequencies for the concept, idf imax and idf tein , are 
calculated at step 98. At step 100, if idf,^ is within 0.05 of 
idf imaxt the mean frequency f mean is computed from f^ and 



matching strategy, and treat the phrase as a pure proximity 
term by assigning the default belief (0.4) to all documents 
containing the partial phrase but not the full phrase (step 
126). F or phrases which appear less ofteri (where idf, is 
greater than 0.3), m e maximum belief acB fe yed byanv 
single word of the parnlflljEr ase is assigned toj heMieTror 

mejytiaTphrase : 

As previously explained, duplicate terms are purged from 
the search query. However, where duplicate terms appear in 
the search query, the component probability score for each 
document containing the term is multiplied by the query 
frequency. For example, if a document contains a term 
which appears twice in a natural language query receives a 
component probability of 0.425, the probability score is 
multiplied by 2 (to 0.850) for that term. When the prob- 
abilities are summed and normalized as described above, the 
normalization factor is increased to reflect the frequency of 
the duplicated term (increased by 1 in this example). Thus, 
the duplicated term is treated as if it had been evaluated 



f min at step 102, and the estimated inverse document fie- 20 multiple times as dictated by the query, but in a computa- 



quency, idf„ is computed at step 104 for the concept As 
shown at step 100, if the range between the maximum and 
minimum inverse document frequencies is greater than 0.05, 
the process loops back to point 80 to expand the sample and 
the number of documents until the bounds of the estimates 25 
are within 0.05 at step 100 or until the entire collection has 
been examined (x=n c ) at step 74. 

As indicated above, it is possible that the entire collection 
could be examined before determining an estimated inverse 
document frequency for the selected concept This might 30 
occur, for example, where a concept very rarely appears in 
the documents. In such a case, at step 74, the computer 
determines that the number of documents in the sample (xj 
is equal to the number of documents in the collection (nj, 
in which case the actual inverse document frequency for the 35 
concept is computed at step 106. 
Partial Concepts (Phrases and Proximities) 

As shown by Equation 4, the probability is com puted f or 
each ^concept/document pair, and the probabiGries are 



tionally simpler manner. 
. t As described above, the probability estimates for each 
0 document/concept pair are summed and the result is nor- 
malized by the number of concepts in the query. For the 
example given in FIG. 4 the search query shown in block 46 
employs eleven concepts, so the total probability for each 
document will be divided by 11 to determine the overall 
probability that the given document meets the overall query. 
For example, assume for a given document that the eleven 
probabilities are: 



0.400 
0543 
0512 
0.460 



0.430 
0.436 
0.400 
0.472 



0.466 
0.433 
0.481 



The overall probability is the sum of the individual prob- 
abilities (5.033) divided by the number of concepts (11) for 
_ . . a total probability of 0.458. This indicates a probability of 

summed. The result is normalized by the number of concepts 40 0.458 that the document meets the full query shown in block 



in the query to determine the overall probability estimate 
that the document satisfies the information requirement set 
forth in the query. 

Phrases are treated in a manner similar to proximity terms, 
except that a document which does not contain the full 45 
phrase receives a partial score for a partial phrase. For 
example, if a query contains the phrase "FEDERAL TORT 
CLAIMS ACT" and a document contains the phrase "tort ~ 
claims" but not "Federal Tort Claims Act", the document 



40 in FIG. 4. The probability is determined for each docu- 
ment represented in the database, whereupon they are ranked 
in accordance with the value of the probability estimate to 
identify the top D documents. The ranking or identification 
is provided by computer 32 (FIG. 3) to computer 20 for 
display and/or printout at output terminal 22. Additionally, 
die document texts may be downloaded from computer 32 to 
computer 20 fdr~display and/OT printout at output terminal 
22. 



will receive a score based on the frequency distribution 50 Probability Thresholds 



associated with "TORT CLAIMS'*. FIG. 8 is a flow diagram 
illustrating the process of handling partial matches. As 
shown at step 120, the full phrase is evaluated against the 
collection as heretofore described. The inverse document 
frequency (idfj is determined for the full phrase (step 122), 55 
and if idf, is greater than a predetermined threshold (e.g., 
0.3) the maximum belief achieved for any subphrase or 
single term is selected as the belief for the partial phrase 
(step 124). If idf, is smaller of equal to the threshold value 



As previously described, the probabilistic document 
retrieval system retrieves a predetermined number (D) of 
documents having the highest probability of meeting the 
information need set forth in the query. These probabilities 
are identified by the normalized sum of the probabilities of 
each representation in the document matching the concept in 
the query. Significant processor resources are required to 
compute these probabilitieTTbT eacrraocument in a large 
document database, for example about 500,000 documents 



(0.3), the preselected default belief (0.4) is assigned to the 60 or more. To re^uH"prDcessing resources, it is desirable to 



documents containing the partial phrase (step 126) . ^ 
Si nce the frequency of 'TORT CLAIMS" must equal or 
excgejjjgat of the longerjpnrase, the probabih'tY _estimatei5r 
mT^rr^nnras£_ would.generally^be lower than that 
assig ned to documents containing the oDmplete.phrase. For 65 
phrases which occur extremely often (for example, where 
idf, is less than 0.3) it is preferred to dispense with the partial 



limit probability computations to a reasonable number. 

One technique to reduce processing resources is to 
employ a probability threshold against which the probabili- 
ties of documents are compared to determine whether or not 
the probability of a given document meets or exceeds the 
threshold. For example, in a document retrieval network 
designed to retrieve 10 documents, the probability threshold 
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may be set equal to the probability of the lowest ranked 
document of 10 selected documents, lb identify 10 docu- 
ments from a database of 500,000 documents, the first 10 
documents of the database are listed to a result list (making 
the initial ranking of the top 10). A probability threshold is 
set equal to the probability of the lowest-ranked document of 
the first 10 selected documents. The probability of the 11th 
document is computed and compared against the probability 
threshold. If the probability of the 11th document exceeds 
that lowest ranked document of the original 10, the 11th 
document is entered into the result list of 10 selected 
documents and the prior lowest ranked document is 
removed. A new probability threshold is set to the probabil- 
ity of the new lowest ranked document of the original 10 
selected documents. Hence, the probability threshold is a 15 
'^running" threshold, constantly updated and increased in 
value as additional documents are identified which exceed 
the previous threshold. 
It will be appreciated that at some point in the document 
^ identification process, the threshold becomes so high that 20 
many documents may be discarded from consideration after 
consideration of only a few of the representation probabili- 
ties. Assume, for example, a query containing eleven con- 
cepts and a probability threshold of 0.8965 (well into the 
document identification process). For a document to meet 25 
the threshold, it must have a minimum sum of individual 
probabilities of 9.8615 (11x0.8965). Under such circum- 
stances, a low representation probability amongst the first 
few representations may result in a mathematical impossi- 
bility of meeting the threshold. For example, if the first two 30 
representations of a document have probabilities of 0.3 1 1 
and 0.400, giving a sum of 0.711, it will not be possible for 
that document to make the result list of 10. Even if the 
representation probabilities matching the other nine QQg- 
cegts each had a probability of 1.0, the maximum sum of 35 
probabilities would be 9.711 winch is normalized to a 
maximum probability of 0.8828, below the probabil ity 
t hresho ld. Consequently, it is unnecessary to calculatelhe 
additional representation probabilities for the document or to 
further process the document's probabilities. — >40 

It can be appreciated from the foregoing that comparing 
the document's probabilities against the threshold can pro- 
vide a significant savings in processing resources. 

While the foregoing probability thresholds provide sig- 
nificant savings in processing resources, particularly well 45 
into the search, very little savings is realized at the early 
stages of the se arch. FI G. 9 isa graph illustrat ing a threshold 
setting technique as described above. The process com- 
mences with a probability threshold of zero, following curve 
130. When the predetermined number of documents D are 50 
initially identified, the initial threshold is established as the 
lowest probability of the initial 10 documents, and subse- 
quent documents are compared against the threshold. As 
additional documents are processed and the threshold value 
increases, it can be appreciated from FIG. 9 that the thresh- 55 
old value follows curve 130 approaching maximum thresh- 
old level 132. It can be shown that the documents requiring 
examination against the probability is high at the early 
stages of the process and decreases as the process advances. 
Hence, the area of the graph of Figure 9 above the curve of 60 
line 130 is representative of the number of documents 
requiring processing and is representative of the required 
processing resources. 

One feature of the present invention resides in the early 
estimations of the probability threshold for documents meet- 65 
ing the information need of the query. More particularly, by 
selecting a sample of documents and setting the initial 



probability threshold as equal to the probability of the 
document in the sample having the highest probability, an 
initial threshold may be established against which further 
documents may be compared as previously described. This 
(< running start" is shown in FIG. 9 as the initial threshold for 
the process. 

As the search continues through the collection, fewer 
documents have their probabilities scored and the probabil- 
ity threshold increases. Hence, document selection follows 
curve 134 in FIG. 9. Hie establishment of an initial threshold 
as described, results in a smaller area above line 134; the 
shaded area 136 represents a reduction in processing 
resources required for conducting the search. 

It can be statistically shown that a document retrieval 
system, seeking to retrieve 10 documents meeting an infor- 
mation need defined by a query from a document collection 
of 500,000 documents, will, with a 5% maximum probable 
error rate, find one document in the first 309 documents, two 
documents in the first 1 1,095 documents, three documents in 
the first 25,070 documents, and so on in accordance with the 
following Table I: 

TABLE I 



Sequence 


Limit (D) 


309 


1 


11,095 


2 


25,070 


.3 


48,843 


4 


80,269 


5 


118,159 


6 


161.889 


7 


211,278 


8 


266,579 


9 


500,000 


10 



The software algorithm for selecting the sequence of 
numbers for Table I is set forth below, where cs is the 
collection size (equal to n c , the number of documents in the 
collection), gs is the goal size (equal to D, the number of 
documents to be selected or identified) and me is the 
maximum error sought For Table I, cs is 500,000 , gs is 10 
and me is 0.05. 



SOFTWARE ALGORITHM 



me = me + ((gs - 1) * 100) 
conf = 1.0 - me 
p = gs + cs 
lowi = Rog(conf)) + p~ 



(natural log) 



IF lowi = 0 THEN table(l) = lowi 4- 1 

ELSE tabled)- lowi 
DO (j = 1 to (gs - 2)) 
lowi 5= lowi + 1 
o ldhi =3 cs — 1 

WHILE ((oldW - lowi) o 1) 

Mghi = ((lowi + oldhi - 1) ■•■ 2) + 1 

Iftmfr da a hi g hj * p 

tenn = exp(— lambda) 

sum — tenn 

DOi=lTOj 

tenn = term * (lambda + i) 
sum = ram + term 

ENDDO 

IF sum > conf THEN lowi = higtri 
ELSE oldhi = tri ghi 

END WHILE 

table Q+l) = lowi 
ENDDO 
table (gs) = cs 



Hie forgoing software algorithm and Table I are 
employed to statistically optimize the probable document 
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distribution in the collection, and identifies one document to 
the result list during the first iteration, two documents to the 
result list during the second iteration, etc. until the final 
selection of ten documents are entered to the result list 
during the tenth iteration. During each iteration, a new 5 
sample of documents is selected from the collection, each 
sample being distinct from every other sample. Thus, refer- 
ring to Table I, the first sample comprises documents 1 
through 309, the second sample comprises documents 310 
through 11095, the third sample comprises documents 10 
11096 through 25070, etc. During the first iteration, the one 
document having the highest probability of meeting the 
information need defined by the query is selected from 
documents 1 through 309. During the second iteration, two 
documents having the two highest probabilities are selected 15 
from the group consisting of the sample of documents 
(documents 310 through 11095) plus the one document 
selected from the previous iteration. During the third itera- 
tion, three documents having the three highest probabilities 
are selected from the group consisting of documents 11096 20 
through 25070 plus the two documents selected during the 
second iteration. Trie process continues through all iterations 
(10 in the example) to identify the predetermined number D 
of documents (10 in the example). 

It is evident from the foregoing that if a given sample, 25 
such as the third sample, has two documents having prob- 
abilities which exceed the lowest of the previously selected 
documents, one previously selected document will be 
removed from the selection list The ultimately selected 
documents, being ten in number, are not necessarily selected 30 
one from each of the ten samples. Instead, the selected 
documents are those ten documents having the highest 
probability of meeting the information need defined by the 
query, within a given error, such as 5%. While the above 
software algorithm sets forth the sample selection technique 35 
for any given number of documents to be identified, the 
above Table I sets forth a preferred example in connection 
with a document database of 500,000 documents selecting 
10 documents most likely to meet the information need. 
Clearly, the algorithm may be used to provide the parameters 40 
for databases of other sizes, selection of other numbers of 
documents, and tolerance within other maximum error rates. 
Moreover, the algorithm may be modified to fit other 
examples in other situations, and, in fact, other algorithms 
are possible to define the sampling technique. 45 

It may be desirable to employ the probability threshold 
technique described above with the statistical optimization 
selection described above. Hence, referring to TableT, tiie ~~ 
probability threshold may be set from the first sample 
requiring that documents selected during successive itera- 50 
tions also equal or exceed the probability threshold. As the 
processing continues, if the document of the first sample is 
ultimately replaced (that is, for a given iteration the prob- 
ability of the first sample document is exceeded , by the 
probabilities of at least the number of documents required by 55 
the iteration), a new threshold is established as the threshold 
of the new lowest document Consequently, the probability 
threshold level continues to advance as documents are 
continued to be identified 

FIG. 10 is a flowchart of the steps of the statistical .60 
optimization selection technique of developing the probabil- 
ity threshold and document distribution optimization for the 
present invention. 

More particularly, at step 150 the document distribution 
table of Table I is initialized to meet the criteria for error, 65 
numbers of documents sought, and collection size in accor- 
dance with the above-described software algorithm. At step 
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152, the probability threshold value is initialized to 0 and the 
number of documents sought to be identified, D, is initial- 
ized to one. At step 154, a document from the collection is 
scored utilizing the maximum score optimization technique, 
explained below in connection with FIG. 11. At the same 
time, the number of documents processed since the previous 
document was scored is identified. At step 156, a count is 
incremented identifying the total number of documents from 
the collection which had been processed 

Referring to Table I, if the first thirty documents of the 
collection contain no representations matching a concept of 
the query, the documents will not be scored because their 
probabilities would be 0.4. If the thirty-first document is the 
first document of the collection having representations 
which meet concepts of the query, that document is located 
and scored at step 154 using the maximum score optimiza- 
tions described below. At the same time, a count of 31 is 
entered, representative of the number of documents pro- 
cessed (x,), Since the thirty-first document is the only 
document in the result list, it is placed at the top of the result 
list 

At step 158, the value from the table corresponding to D, 
is compared against the number of documents x, counted at 
step 156. If the number of document, x,, is smaller than the 
number D„ the process continues to step 160. At step 160, 
each scored document is entered into the result list stored in 
the memory of the computer in descending order of prob- 
abilities. Thus, the document with the highest probability 
appears at the top of the result list whereas the document 
meeting the maximum score optimizations having the lowest 
probability is at the bottom of the list In the initial iteration, 
x, is 31 since thirty-one documents had been processed, and 
the value from Table I is 309 (corresponding to Dpi). 

Since the value from the table, 309, is greater than x,, 31, 
the probability threshold is set at step 162 to the score for the 
Dth document in the result list, which in the example is the 
thirty-first document At step 164, the number of documents 
processed, x f , is compared to the total number of documents 
in the collection, n c , and if the number of documents 
processed is smaller than the number of documents in the 
collection, the process loops back through point 166 to 
return to step 154. Any further documents which have 
probabilities less than the threshold probability (or which 
cannot mathematically achieve a probability greater than the 
probability threshold after calculation of less than all rep- 
resentation probabilities) are excluded (or not scored) at step 
154. 

Assume document one hundred eighty has a probability 
greater than the probability threshold established by docu- 
ment thirty one. Hence, document one hundred eighty is 
identified at step 154 and inserted into the result list in 
probability order, which is greater than document thirty one. 
At step 156, x, is incremented to indicate the count, 180, of 
the number of documents thus far processed, which count is 
still smaller than 309, the number in Table I associated with 
D r Consequently, the sequence proceeds to step 160 to insert 
document one hundred eighty into the result list At step 162 
the probability threshold is set to the score of the Dth 
document in the result list Since D, is 1, the probability 
threshold is set to the score of document one hundred eighty. 

Assume the next document having a probability greater 
than the probability threshold set by document one hundred 
eighty is document six hundred ten. Document six hundred 
ten is found and scored at step 154. At step 156 the count x, 
is incremented to 610, and since the value 309 from Table I 
is not greater than 610 at step 156, D, is incremented by 1 
at step 168 so that the new value from Table I to be 
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considered is 11,095. Hie process loops back to step 158 
where the value 11,095 from Table I is found to be greater 
than 610. Hence the process continues to step 160 where 
document six hundred ten is inserted in the result list in 
probability order. At step 162 a new probability threshold 5 
equal to the Dth document in the result list is to be set. In this 
case, however, nothing occurs because D r - is now set to 2, 
meaning that both documents one hundred eighty and six 
hundred ten appear in the result list, and the probability 
threshold will continue to be set to the score of the document 
of the result list having the lowest probability, namely 
document one hundred eighty. 

The process continues through the remainder of the 
database, incrementally increasing the value from Table I 
against which the document number is processed at step 158, 
the process continuing until 10 documents are identified and 15 
all documents in the database have been processed. When 
this occurs, x t - equals n c at step 164 and the final result list 
is retrieved at step 168. 

It might be advantageous, particularly where small docu- 
ment collections are to be searched and processing power is 20 
large, to perform the process of FIG. 10 for only a single 
iteration to find the document of the first sample having the 
highest probability and setting the probability threshold to 
the probability of that document for scoring the remainder of 
the document collection in the manner described above. 25 
Thus, the probabilities of documents added to the result list 
must exceed the initial probability threshold, at least until 
the preselected number of documents is added to the result 
list Thereafter, the probability threshold is increased as 
additional documents having higher probabilities are added 30 
to the list and documents with the lowest probabilities are 
removed from the list 

In any event, if less than the preselected number of 
documents are ultimately identified to the result list, a new 
probability threshold may be established slightly below the 35 
probability of the document on the result list with the lowest 
probability and the entire collection re-scored as described 
above. 

Maximum Score optimization 

This technique is illustrated in the flow chart of FIG. 11. 40 
More particularly, FIG. 11 illustrates the iterative loops for 
scoring documents employed at step 154 in FIG. 10. Each 
document in the document database has a document number 
associated with it The maximum score optimization com- 
mences with the concept i x in the query having the highest 45 
idf*. A lower bound document number is chosen (such as the 
lowest document number in the database). The first docu- 
^Wt dywb^se^iocument number - isgieater than the lower 
bound document number and which contains the concept ij 
is selected as a candidate document so 

A remainder score is initialized to the maximum possible 
score less the value mat document d^ scores for the concept 
i Y being examined. Tims, the remainder score value repre- 
sents the maximum score which each document which does 
not contain concept i t could achieve without concept i,. The 55 
process continues by iterating through each of the concepts 
i 2 , i 3 , etc. The concepts are processed in descending order of 
concept idf f value. . As noted above, the concept with the 
highest idf, is the concept which appears least frequently in 
the collection and is more likely to be a good Discriminator 60 
than a concept which appears more often. The processing for 
each concept commences with the document having a docu- 
ment number greater than or equal to the lower bound 
document number. 

In the processing, three conditions can occur. 65 

1. If the document number for the current concept is equal 
to that of the candidate document, the candidate document 



contains the concept and no change is made to the maximum 
score. Instead, the process continues to the next concept 

2. If the document number for the current concept is 
greater than that of the candidate document, the current 
document does not contain the concept and the value of the 
current concept is subtracted from the maximum score for 
the candidate document and the remainder score is adjusted. 
If the maximum score is still high enough that the candidate 
document might still be selected, the processing will con- 
tinue to the next concept If not, the candidate document is 
discarded and the processing starts over with the next higher 
document number as the candidate document 

3. If the document number for the current concept is less 
than that of the candidate document, a document exists with 
a lower number which must be evaluated before continuing 
with the candidate document 

The remainder score tabulated for each document repre- 
sents the maximum score that document can achieve based 
on the concepts processed up to that point and the possibility 
that it contains all the subsequent concepts. As each concept 
is processed, the remainder score for the document is 
reduced by the value of the concept for each document in 
which the concept does not appear. In considering the 
remainder scare, two possibilities exist. 

1. If the remainder score is less than the minim um 
document score necessary to remain in the result list, then 
that document, and all other documents up to the candidate 
document number, can be discarded, since it is not possible 
for any of them to achieve a document score high enough to 
remain in the result list In this situation, the next document 
number which is greater than or equal to the candidate 
document number is selected for the concept and the pro- 
cessing continues as described above. 

2. If the remainder score is not less than the minimum 
document score necessary to remain in the result list, then 
the document is considered as a candidate for the result list 
In this case, the document score for the document is set to 
the current remaining score and the candidate document 
number is reset 

The process continues until a candidate is found having a 
maximum possible score greater than the probability thresh- 
old required to remain in the result list 

The process of the maximnm score optimization may be 
explained with reference to the flowchart of FIG. U. At step 
180 the lower bound document number, probability thresh- 
old (from step 152 or 162 in FIG. 10) and the maximum 
possible score are inputted. For the initial iteration for a 



given document, the probability threshold is initialized to 0 
at step 152 in FIG. 10 and the maximum possible score is 
initialized. The lower bound document number is set to the 
first document in the database desired to be reviewed. At 
step 182, the first document having a document number 
greater than or equal to the lower bound document number 
and which contains the concept having the highest idf, is 
identified as a candidate document Thus, the document 
number is identified for the first document containing the 
concept At step 184, the remainder score for all other 
documents having a lower number is initialized to be equal 
to the maximum possible score less the incremental concept 
value from the missing concept i x having the highest idf,. At 
step 186, a decision is made as to whether all the concepts 
have been processed, and if they have not, the current 
concept is set to the concept i 2 whose idf, is next highest in 
value below the first concept i u at step 188. At step 190, the 
document number is set to the document number of the next 
document greater than or equal to the lower bound document 
number for the current (second) concept i 2 . At step 192, if 
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the document number of the document containing the con- 
cept is less than the current candidate document number, 
then the decision is made at step 194 whether the remainder 
score is smaller than the probability threshold initialized at 
step 152 or set at step 162 in FIG. 10. If the remainder score 5 
is smaller than the minimum probability threshold, then the 
lower bound document number is set to the current candi- 
date document number and the document number of the next 
document containing the concept i 2 currently being pro- 
cessed is set to the next document number greater than or 10 
equal to the current lower bound document number for the 
current concept The concept incremental value is subtracted 
at step 200 from the remainder score. If, at step 194, the 
remainder score is greater than or equal to the probability 
threshold, then the candidate document number is set, at step is 
202, to the document number of the next document con- 
taining the concept, and the candidate document score is set, 
at step 204, to the remainder score. The process then 
continues to step 200 to subtract the concept incremental 
value from the remainder score for the documents not 20 
containing the concept 

If at step 192 the document number containing the con- 
cept is greater than or equal to the candidate document 
number, then the process continues directly to step 200 
where the concept incremental value is subtracted from the 25 
remainder score for the documents not containing the con- 
cept 

At step 206, if the document number containing the 
concept is equal to the candidate document number, then the 
candidate document is found to contain the concept, and the 30 
process returns to step 186 and processes through the loop 
again for the next concept If the document number con- 
taining the concept is not equal to the candidate document 
number, then the concept incremental value is subtracted 
from the candidate document score at step 208. If the 35 
resulting candidate document score is greater than the prob- 
ability threshold, the process loops back through step 186 
again. On the other hand, if the candidate document score is 
not greater man the probability threshold, the lower bound 
document number is set to the candidate document number 40 
plus 1 and the process reloops to step 182. 

If a candidate document loops through the process of FIG. 
11 through all of the concepts of the query, and the document 
score is greater than the probability threshold at step 210, 
step 186 identifies that all concepts have been processed and 45 
returns the document at step 214 for insertion into the full 
result list in sorted order at step 156 in FIG. 10. The process 
terminates foTa given threshold value ordy^h^n^caiididate ~ 
is found, after all concepts have been examined, which has 
a maximum possible score greater than the probability 50 
threshold required to remain in the result list The process 
iterates through the loops illustrated in FIG. 10 until the 
required number of documents for the result list is identified 
The documents may then be retrieved from database using 
the result list at step 170, the scoring of each document 55 
occurring through the iterations of the loops of FIG. 11. 

It may be desirable to incorporate certain relational con- 
straints on the placement of documents into the result list As 
one example, it might be desirable to limit the search output 
to documents dated after a given date. Suffice it to say that 60 
such a constraint can be imposed on the document retrieval 
system in a manner well known in the art 
Document Retrieval 

FIGS. 12 and 13 are flowcharts detailing the construction 
and evaluation of an inference network, FIG. 12 being a 65 
detailed flowchart for constructing the query network 12 and 
FIG. 13 being a detailed flowchart for evaluation the query 
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network in the context of the document network 10. As 
heretofore described, an input query written in natural 
language is loaded into the computer, such as into a register 
therein, and is parsed (step 220) compared to the stopwords 
in database 222 (step 224) and stemmed at step 226. The 
result is the list 42 illustrated in FIG. 4. Using synonym 
database 228, the list is compared at step 230 to the synonym 
database and synonyms are added to the list As will be 
explained hereinafter, the handling of synonyms may actu- 
ally occur after handling of the phrases. Citations are located 
at step 232 as heretofore described. More particularly, a 
proximity relationship is established showing the page num- 
ber within five words of the volume number, without regard 
to the reporter system employed. The handling of citations, 
like the handling of synonyms, may be accomplished after 
phrase resolution, if desired. 

Employing phrase database 234, a decision is made step 
236 as to whether or not phrases are present in the query. If 
phrases are present, a comparison is made as step 240 to 
identify phrases. At step 242 a ^termination is made as to 
whether successive phrases share any common term(s) (an 
overlap condition). More particularly, and as heretofore 
described, terms which are apparently shared between suc- 
cessive phrases are detected at step 242. At step 244 a 
determination is made as to which phrase is the longer of the 
two phrases, and the shared term is included in the longer 
phrase and excluded from the shorter phrase. As a result of 
deleting the shared term from the shorter phrase, the result- 
ing shorter phrase may not be a phrase at all, in which case 
the remaining term(s) are simply handled as stemmed 
wards. On the other hand, if the two phrases are of equal 
length, then the shared term is accorded to the first phrase 
and denied to the second phrase. 

After overlap conflict is resolved at step 244, the resulting 
phrase substitution occurs at step 246. The process loops 
back to step 236 to determine if phrases are still present, and 
if they are the process repeats until no further phrases are 
present At step 238, all duplicate terms are located, mapped, 
counted and removed, with a count V representing the 
number of duplicate terms removed. Thus, the search query 
illustrated at block 46 in FIG. 4 is developed 

As heretofore described, the handling of synonyms and 
citations may occur after resolution of the phrases, rather 
than before. 

As illustrated in FIG. 13, the resulting search query is 
provided to the document network where, at step 250 the 
number of terms T is counted, at step 252 i is set to 0 and 
at sterT254 1 is^dded tb^irUsir^docu^^rdatebase^256 
which also contains the text of the documents, the inverse 
document frequency (idf<) is determined and the probability 
estimate (tr}) is determined at step 258. As noted above, both 
tf^and idf, are calculated from addresses, document numbers 
and offset data in the word index of the document database. 
The estimated inverse document frequency (idQ is also 
added to the database by a temporary memory or register. 
The component probability is determined at step 260 as 
heretofore described and is accumulated with other compo- 
nent probabilities at step 262. At step 264 a determination is 
made as to whether or not i equals T 

(where T is the number of terms in the search query). If 
all of the terms have not been compared to the database, the 
process is looped, adding 1 to i and repeated for each term 
until i equals T at step 264. As heretofore described, when 
terms having duplicates deleted from the input query are 
processed at step 258, the probability for such terms is 
multiplied by the number of duplicates deleted, thereby 
waghing the probability in accordance with the frequency 
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of the term in the original input query. Consequently, at step 
266, it is necessary to divide the accumulated component 
probability for the document by V+T (where V is the number 
of duplicate terms deleted from the input query) to thereby 
normalize the probability. The probability for each docu- 5 
ment is stored at step 268 and the process repeated at step 
270 for the other documents. At step 272 the documents are 
ranked in accordance with the determined probabilities, and 
the top ranked documents are printed out or displayed at step 
274. l0 

As previously described, the scan technique may be a 
concept-based scan, rather than the document-based scan 
described. Further, as previously described, the scan may be 
aborted after less than complete scan of any given document 
if the probabilities result in a determination that the docu- 13 
ment will not reach the cutoff for the D top-ranked docu- 
ments to be displayed or printed. 

While the present invention has been described in con- 
nection with a time-shared computer system shown in FIG. 
3 wherein search queries are generated by PC computers or 2Q 
dumb terminals for transmission to and time-shared pro- 
cessing by a central computer containing the document 
network, it may be desirable in some cases to provide the 
document network (with or without the document text 
database) to the user for direct use at the PC. In such a case, ^ 
the document database would be supplied on the same ROM 
24 as the databases used with the search query, or on a 
separately supplied ROM for use with computer 20. For 
example, in the case of a legal database, updated ROMs 
containing the document database could be supplied peri- 3Q 
odicaliy on a subscription basis to the user. In any case, the 
stopwords, phrases and key numbers would hot be changed 
often, so it would not be necessary to change the ROM 
containing the databases of stopwords, phrases and key 
numbers. 35 

Although the present invention has been described with 
reference to preferred enibodiments, workers skilled in the 
art will recognize that changes may be made in form and 
detail without departing from the spirit and scope of the 
invention. ^ 

What is claimed is: 

1. In a computer system for identifying a predetermined 
number of documents of a document collection containing 
representations that have high probabilities of matching a 
query containing a plurality of concepts, in which the system 45 
has a database containing identifications of documents in the 
document collection and defining a plurality, of represen- 
tations representing the contents of- the documents, the — 
collection comprising a plurality, of documents, and query 
means for defining the query, apparatus comprising: 5Q 
sample selection means for iteratively selecting succes- 
sive samples of a plurality of documents from the 
collection, each sample containing fewer documents 
than the entire collection and each successive sample 
containing documents different from each previous 55 
sample; 

processing means responsive to the sample selection 
means for calculating, during each iteration, probabili- 
ties that documents contained in the sample contain 
representations mat match the query and for identifying 60 
a preselected number of documents having the highest 
probabilities, the documents being identified during an 
iteration from a group consisting of the respective 
sample of documents and the documents identified 
during the next previous iteration, the preselected num- 65 
ber being different for each iteration and no greater than 
the predetermined number; and 
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output means outputting the identifications of the prede- 
termined number of documents identified by the pro- 
cessing means. 

2. The apparatus according to claim 1 further including 
threshold setting means responsive to the processing means 
for setting a probability threshold equal to the probability of 
a first identified document 

3. The apparatus according to claim 2 including deter- 
mining means operable during each respective iteration and 
responsive to the identification of the preselected number of 
documents by the processing means to determine if an 
additional document of the respective sample has a prob- 
ability greater than the probability threshold, the processing 
means being responsive to the determining means identify- 
ing an additional document having a probability greater that 
the probability threshold to replace the previously-identified 
document having the lowest probability by the additional 
document, and the threshold setting means being responsive 
to the processing means to reset the probability threshold to 
the probability of the identified document having the new 
lowest probability. 

4. The apparatus according to claim 1 wherein the pre- 
selected number is equal to the number of the respective 
iteration, and the predetermined number is equal to the 
number of the last iteration. 

5. The apparatus according to claim 1 including estimat- 
ing means responsive to the processing means for estimating 
a maximum probability for a second document different 
from the first document based on a partially calculated 
probability for the second document and an assumption that 
the representations in the second document match the con- 
cepts of the query for which probabilities have not been 
calculated, the processing means being responsive to the 
estimating means to calculate partial probabilities that rep- 
resentations in the second document match concepts of the 
query until either the estimated maximum probability does 
not at least equal the threshold or the probability is calcu- 
lated for all the concepts in the query. 

6. The apparatus according to claim 5 wherein the output 
means includes a result list ranking the identified documents 
in probability order, the threshold setting means being 
responsive to the result list to reset the probability threshold 
equal to the probability of the document lowest on the result 
list 

7. The apparatus according to claim 1 wherein the pro- 
cessing means includes a result list ranking the identified 
documents in probability order. 

8. A system for identifying documents matching a com- 

prising: 

a memory containing a database containing identification 
of documents in a document collection and defining a 
plurality of representations representing the contents of 
the documents, the collection comprising a plurality of 
documents, the database further containing indications 
of the frequencies of occurrence of documents contain- 
ing first representations in the collection; 

computer means responsive to a query defining a plurality 
of concepts, the computer means including 

matching means for matching the concepts to represen- 
tations, 

estimating means for estimating the frequency of occur- 
rence of documents containing a second selected rep- 
resentation in the collection, the second selected rep- 
resentation being different from any of the first 
representations, the estimating means including 
sample selection means for selecting a sample com- 
prising a plurality of documents from the collection, 
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the sample containing fewer documents than the 
entire collection; 
frequency identifying means responsive to the sample 
selection means for identifying the frequency of 
occurrence of documents containing the second 5 
selected representation in the selected sample of 
documents; 

processor means responsive to the memory means and 
to the frequency identifying means for calculating a 
maximum and a minimum probable frequency of 1Q 
occurrence of documents containing the second 
selected representation in the collection; and 

selection means responsive to the processor means for 
selecting the midpoint, of the maximum and mini- 
mum probable frequencies as the estimated fre- 
quency of occurrence of the second selected repre- 15 
sentation; 

retrieval means for selecting documents meeting the 
query based on the frequencies of occurrence of docu- 
ments containing first representations which match the 
concepts and the estimated frequencies of occurrence 
of documents containing second representations which 
match the concepts, and 

output means responsive to the retrieval means and the 
memory for outputting identifications of the selected ^ 
documents. 

9. The system according to claim 8 wherein the processor 
means includes means for identifying if the difference 
between the maximum and rnim'rnnrn probable frequencies 

is within a preselected limit, and further including adjusting 30 
means responsive to the processor means for adding addi- 
tional documents from the collection to the sample of 
documents if the calculating difference between the maxi- 
mum and minimum probable frequencies exceeds the pre- 
selected limit 35 

10. The system according to claim 8 where the processor 
means calculates the maximum probable frequency, f maXf 
and the minimum probable frequency, f^, in accordance 
with relationships based on the number of gaps between 
documents in the sample containing the second selected ^ 
representation (n^), the number of documents in the collec- 
tion (nj, and the number of documents in the sample (x^). 

11. The system according to claim 10 where and f^ 
are calculated accordance with the relationships 
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cepts, matching means for matching concepts to represen- 
tations means for selecting documents meeting the query 
based on frequencies of occurrence of documents in the 
collection containing representations matching the concepts, 
and output means responsive to the means for selecting 
documents for outputting identifications of selected docu- 
ments, the improvement comprising a process of estimating 
the frequency of occurrence of documents containing a 
representation in the collection of documents for which the 
database does not contain a frequency of occurrence, com- 
prising: 

identifying, on the basis of concepts in the query, a 
representation for which the database does not contain 
a frequency of occurrence; 

selecting a sample comprising a plurality of documents 
from the collection, the sample containing fewer docu- 
ments than the entire collection; 

identifying the frequency of occurrence of documents 
containing the identified representation in the selected 
sample of documents; 

calculating a maximum and a minimum probable fre- 
quency of occurrence of documents containing the 
identified representation in the collection; and 

selecting a midpoint of the maximum and minimum 
probable frequencies as the estimated frequency of 
occurrence of documents containing the identified rep- 
resentation, 

whereby the means for selecting documents meeting the 
query is responsive to the frequencies of occurrence in the 
database of documents in the collection containing repre- 
sentations matching the concepts and to estimated frequen- 
cies of occurrence to select documents in the collection 
containing representations matching the concepts. 

13. The process according to claim 12 further including 
identifying whether the difference between the maximum 
and minimum probable frequencies is within a preselected 
limit, and adding additional documents to the sample from 
the collection if the calculated difference between the maxi- 
mum and minimum probable frequencies exceeds the pre- 
selected limit 

14. The process according to claim 13 where the prese- 
lected limit is 0.05. 

15. The process according to claim 12 where the maxi- 
mum probable frequency, f maxt and the minimum probable 
frequency, f^, are calculated in accordance with the rela- 
tionships 



50 



xi-zsi 



and 



where 

s, is the greater of x/n, or the standard deviation of the n, 35 
gaps, and 

z is the standard critical value for normal distribution for 
a preselected reliability. 

12. In a system for identifying documents matching a 
query, in which the system has a database containing iden- 60 
tifications of documents in a document collection and defin- 
ing a plurality of representations representing the contents of 
the documents, the collection compromising a plurality of 
documents and the database containing a frequency of 
occurrence of documents containing each of at least some of 65 
the representations in the collection of documents, query 
means for defining a query, containing a plurality of con- 



fmin = "i + 



^-4 



miric-xd 



where 

n, is the number of gaps between documents in the sample 

containing the selected representation, 
n c is the number of documents in the collection, 
x i is the number of documents in the sample, 
s, is the greater of x/n, or the standard deviation of the n, 

gaps, and 

z is the standard critical value for normal distribution for 

a preselected reliability. 
16. The process according to claim 15 where the selected 
representation contains a plurality of terms, the method 
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including setting f min equal to n, if the calculated is 
smaller than n^, setthig equal to n^n^l x^) if the 
calculated f ^ is smaller than zero or smaller than n,, and 
setting equal to an a priori maximum if the calculated 
f max is greater than the a priori maximum. 

17. The process according to claim 16 wherein the 
selected representation is a synonym represented by a plu- 
rality of terms, and wherein the a priori maximum is equal 
to the sum of all frequencies of occurrence of documents in 
the collection containing a term of the synonym, said 
method including setting f^ equal to an a priori minimum 
if the calculated is smaller than the a priori minimum, 
where the a priori minimum is equal to the frequency of 
occurrence of documents containing the term of the syn- 
onym appearing in the greatest number of documents in the 
collection. 

18. The process according to claim 16 wherein the 
selected representation is a phrase containing a plurality of 
terms, and the a priori maximum is equal to the frequency 
of occurrence of documents containing the term of the 
phrase appearing in the least number of documents in the 
collection. 

19. The process according to claim 15 where the prese- 
lected reliability is 0.995 and z is 2.8070. 

20. The process according to claim 12 wherein the mid- 
point selected between the maximum and minimum prob- 
able frequencies is the mean of the maximum and minimum 
probable frequencies. 

21. In a computer system for identifying documents 
matching a query, in which the system has a database 
containing identifications of documents in a document col- 
lection and denning a plurality of representations represent- 
ing the contents of the documents, the collection comprising 
a plurality of documents, and query means for denning a 
query containing a plurality of concepts, apparatus for 
identifying documents of the document collection contain- 
ing representations that match the query containing a plu- 
rality of concepts, the apparatus comprising: 

processing means far calculating probabilities that docu- 
ments match the query and for identifying a first 40 
document having a calculated probability; 

threshold setting means responsive to the processing 
means for setting a probability threshold equal to the 
probability of the first document; 

estimating means responsive to the processing means for 
estimating a maximum probability for a second docu- 
ment different from -the first document based on a 
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partially calculated probability and an assumption that 
the representations in the second document match the 
concepts of the query for which probabilities have not 
been calculated; 

the processing means being responsive to the estimating 
means to calculate partial probabilities that represen- 
tations in the second document match concepts of the 
query until either the estimated maximum probability 
for the second document does not at least equal the 
probability threshold or the probability is calculated for 
all the concepts in the query; 

the estimating means being further responsive to the 
processing means ceasing or completing the calculation 
of the probability for the second document to estimate 
a maximum probability for a third document different 
from the first and second documents; and 

output means responsive to the processing means for 65 
outfitting identifications, of only documents whose 
probability is calculated for all concepts in the query. 
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22. The apparatus according to claim 21 wherein the 
output means includes a result list identifying in probability 
order, up to a predetermined number of documents whose 
probability is calculated for all concepts in the query, the 
threshold setting means being responsive to the result list to 
reset the probability threshold equal to the probability of the 
document lowest on the result list. 

23. Apparatus according to claim 21 wherein the thresh- 
old setting means is responsive to the processing means 
calculating the probability for the second document for all 
the concepts in the query to set the probability threshold 
equal to the probability of the second document 

24. The apparatus according to claim 21 wherein the 
output means includes a result list identifying in probability 
order, up to a predetermined number of documents whose 
probability is calculated for all concepts in the query. 

25. A document identification system for identifying a 
predetermined number of documents matching a query, 
comprising: 

a read-only memory containing a database containing 
identifications of documents in a document collection 
and defining a plurality of representations representing 
the contents of documents in the document collection, 
the collection comprising a plurality of documents; 

query means for defining the query containing a plurality 
of concepts; 

computer means responsive to the query containing a 
plurality of concepts, the computer means including 
matching means for matching the concepts to repre- 
sentations; 

sample selection means for iteratively selecting succes- 
sive samples of a plurality of documents from the 
collection for examination, each sample containing 
fewer documents than the entire collection, and each 
successive sample containing documents different from 
each previous sample; 

processing means responsive to the sample selection 
means for calculating, during each iteration probabili- 
ties that documents contained in the sample contain 
representations that match the query and for identifying 
up to a preselected number of documents having the 
highest probabilities, the documents being identified 
during each iteration from a group consisting of the 
respective sample of documents and the documents 
identified during the next previous iteration, the prese- 
lected number being different for each iteration and no 

— greater than the predetermined number; and — 

output means outputting identifications of the predeter- 
mined number of documents identified by the process- 
ing means. 

26. The system according to claim 25 further including 
threshold setting means responsive to the processing means 
for setting a probability threshold equal to the probability of 
a first identified document 

27. The system according to claim 26 including deter- 
mining means operable during each respective iteration and 
responsive to the identification of the preselected number of 
documents by the processing means to determine if an 
additional document of the respective sample has a prob- 
ability grater than the probability threshold, the processing . 
means being responsive to the determining means identify- 
ing an additional document having a probability greater than 
die probability threshold to replace the previously-identified 
document having the lowest probability with the additional 
document, and the threshold setting means is responsive to 
the processing means to reset the probability threshold to the 
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probability of the identified document having the new lowest 
probability. 

28. The system according to claim 25 including estimat- 
ing means responsive to the processing means for estimating 
a maximum probability for a second document different 
from the first document based on a partially calculated 
probability for the second document and an assumption that 
the representations in the second document match the con- 
cepts of the query for which probabilities have not been 
calculated, the processing means being responsive to the 
estimating means to calculate partial probabilities that rep- 
resentations in the second document match concepts of the 
query until either the estimated maximum probability for the 
second document does not at least equal the threshold or the 
probability is calculated for all the concepts in the query. 

29. The system according to claim 28 wherein the output 
means includes a result list ranking the identified documents 
in probability order, the threshold setting means being 
responsive to the result list to reset the probability threshold 
equal to the probability of the document lowest on the result 
list 

30. The system according to claim 25 wherein the output 
means includes a result list ranking the identified documents 
in probability order. 

31. A document identification system for identifying 
documents matching a query, comprising: 

a read-only memory containing a database containing 
identifications of documents in a document collection 
and defining a plurality of representations representing 
the contents of documents in a document collection, the 
collection comprising a plurality of documents, the 
database further containing indications of the frequen- 
cies of occurrences of a plurality of representations in 
the documents; 

query means for defining the query containing a plurality 
of concepts; 

computer means responsive to the query, the computer 
means including matching means for matching the 
concepts to representations; 

calculating means far calculating the probabilities that 
documents meet the query based on the frequencies of 
occurrence of representations in the respective docu- 
ments which match the concepts; 

processing means responsive to the calculating means for 
identifying a first document contained in the sample 4 * 
having the highest calculated probability; 

toeshpkL_setog^3eans_ responsive to the processing 
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means for setting a probability threshold equal to the 
probability of the first document; 
estimating means responsive to the calculating means for 
estimating a maximum probability for a second docu- 
ment different from the first document based on a 
partially calculated probability for the second docu- 
ment and an assumption that the representations in the 
second document match the concepts of the query for 
which probabilities have not been calculated, 
said calculating means being responsive to the estimating 
means to calculate partial probabilities that represen- 
tations in the second document match concepts in the 
query until either the estimated maximum probability 
for the second document does not at least equal the 
probability threshold or the probability is calculated for 
all concepts in the query, 
the estimating means being further responsive to the 65 
calculating means ceasing or the completing the cal- 
culation of the probability for the second document to 
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estimate a maximum probability for a third document 
different from the first and second documents; and 
output means responsive to the processing means for 
outputting identifications of only documents whose 
probability is calculated for all concepts in the query. 

32. The document identification system according to 
claim 31 wherein said output means includes a result list 
responsive to the calculating means to identify in probability 
order up to a predetermined number of those documents 
whose probability is calculated for all concepts in the query, 
said threshold setting means being responsive to the result 
list to reset the probability threshold equal to the probability 
of the document lowest on the result list 

33. In a computer system for identifying documents 
matching a query, in which the system has a database 
containing identifications of documents in a document col- 
lection and defining a plurality of representations represent- 
ing the contents of the documents, the collection comprising 
a plurality of documents, and query means for defining a 
query containing a plurality of concepts, a process of iden- 
tifying a predetermined number of documents of the docu- 
ment collection containing representations that have high 
probabilities of matching the query containing a plurality of 
concepts, the process comprising: 

iteratively selecting successive samples of a plurality of 
documents from the collection for examination, each 
sample containing fewer documents than the entire 
collection, and each successive sample containing 
documents different from each previous sample; 

calculating the probabilities that documents contained in 
the sample contain representations that match the 
query; 

identifying, during each iteration, a preselected number of 
documents having the highest probabilities, the docu- 
ments being selected from a group consisting of a 
respective sample of documents and the documents 
identified during the next previous iteration, the prese- 
lected number being different for each iteration and no 
greater than the predetermined number; and 

outputting identifications of the predetermined number of 
identified documents upon completion of the last itera- 
tion. 

34. The process according to claim 33 including setting a 
probability threshold to the probability of the identified 
document having the lowest probability of all identified 
documents, and during each respective iteration and after the 
pfeselectainumber^fdoc^ 

mining if an additional document of the respective sample 
has been identified having a probability greater than the 
probability threshold, and if so, replacing the previously- 
identified document having the lowest probability with the 
additional document and resetting the probability threshold 
to the probability of the identified document having the new 
lowest r^obabflity. 

35. The process according to claim 33 wherein the pre- 
selected number is equal to the number of the respective 
iteration, and the predetermined number is equal to the 
number of the last iteration. 

36. Hie process according to claim 33 including setting a 
probability threshold equal to the probability of a first 
document, estimating a maximum probability for a second 
document different from the first document based on a 
partially calculated probability for the second document and 
an assumption that the representations in the second docu- 
ment match the concepts of the query for which probabilities 
have not been calculated, and calculating partial probabili- 
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ties that representations in the second document match 
concepts in the query until either the estimated maximum 
probability for the second document does not at least equal 
the threshold or the probability is calculated for all the 
concepts in the query. 5 

37. The process according to claim 36 including ranking 
the identified documents in probability order, and resetting 
the probability threshold equal to the probability of the 
document lowest cm the list. 

38. The process according to claim 33 including ranking 
the identified documents in probability order. 

39. In a computer system for identifying documents 
matching a query, in which the system has a database 
containing identifications of documents in a document col- 
lection and defining a plurality of representations represent- 
ing the contents of the documents, the collection comprising 15 
a plurality of documents, and query means for defining a 
query containing a plurality of concepts, a process of iden- 
tifying documents of the document collection containing 
representations that match the query containing a plurality of 
concepts, the process comprising: 20 

computing the full probability that a first document 
matches the concepts in the query; 

setting a probability threshold equal to the full probability 
of the fiist document; 25 

calculating a partial probability that a second document 
matches some but not all concepts in the query; 

estimating a maximum probability for the second docu- 
ment based on the calculated probability and an 
assumption that the representations in the document 30 
match the concepts of the query for which probabilities 
have not been calculated; 

repeating the steps of calculating and estimating for 
additional query concepts until either the estimated 
maximum probability tier the second document is not ^ 
as large as the probability threshold or the full prob- 
ability of the second document is calculated for all 
concepts in the query; 

repeating the repetitive steps of calculating and estimating ^ 
for a third document different from the first and second 
documents; and 

outputting identifications of only documents having a full 
probability at least as great as the probability threshold 

40. The process according to claim 39 wherein a prede- 45 
terrnined number of documents of the document collection 

is identified and wherein documents whose probabilities are 
calculated for all concepts in the query are identified to a 
result list in probability order, up to said predetermined 
number, said process further including resetting the prob- 50 
ability threshold equal to the probability of the document 
lowest on the result list. 

41. In a system identifying a predetermined number of 
documents matching a query, in which the system has a 
database containing identifications of documents in a docu- 55 
merit collection and defining a plurality of representations 
representing the contents of the documents, the collection 
comprising a plurality of documents, query means for defin- 
ing containing a plurality of concepts, means for determin- 
ing a probability that a document meets the query based on 60 
matches of representations in the document and concepts in 
the query, and output means for outputting the identifications 

of documents having a probability at least as great as a 
probability threshold, apparatus for establishing the prob- 
ability threshold comprising: $5 
sample selection means for iteratively selecting succes- 
sive samples of a plurality of documents from the 
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collection for examination, each sample containing 
fewer documents than the entire collection and each 
successive sample containing documents different from 
each previous sample; 

calculating means for calculating probabilities that docu- 
ments contained in the sample contain representations 
that match the query; 

processing means responsive to the sample selection 
means to identify, during each iteration, up to a prese- 
lected number of documents having the highest prob- 
abilities, the documents being identified during each 
iteration from a group consisting of a respective sample 
of documents end the documents identified during the 
previous iteration; and 

threshold setting means responsive to the processing 
means for setting the probability threshold to the prob- 
ability of the identified document having the lowest 
probability. 

42. The apparatus according to claim 41 including deter- 
mining means operable during each respective iteration and 
responsive to the identification of the preselected number of 
documents by the processing means to determine if the 
processing means identifies an additional document of the 
respective sample having a probability greater than the 
probability threshold, the processing means being respon- 
sive to the deterrnining means to replace the previously- 
identified document having the lowest probability by the 
additional document, and the threshold setting means is 
responsive to the processing means to reset the probability 
threshold to the probability of the identified document 
having the new lowest probability. 

43. The apparatus according to claim 41 wherein the 
preselected number is equal to the number of the respective 
iteration. 

44. In a system for identifying a predetermined number of 
documents matching a query, in which the system has a 
database containing identifications of documents in a docu- 
ment collection and defining a plurality of representations 
representing the contents of the documents, the collection 
comprising a plurality of documents, query means for defin- 
ing a query containing a plurality of concepts, means for 
determining a probability that a document meets the query 
based on a match of representations in the document and 
concepts in the query, and output means for outputting the 
identifications of documents having a probability at least as 
great as a probability threshold, a process for establishing 
the probability threshold comprising: 

iteratively selecting successive samples of a plurality of 
documents from the collection for examination, each 
sample containing fewer documents than the entire 
collection, and each successive sample containing 
documents different from each previous sample; 

calculating probabilities that documents in the sample 
contain representations that match the query; 

identifying, during each iteration, up to a preselected 
number of documents having the highest probabilities, 
the documents being identified during each iteration 
from a group consisting of a respective sample of 
documents and the documents identified during the 
next previous iteration; and 

setting the probability threshold to the probability of the 
identified document having the lowest probability. 

45. The process according to claim 44 including during 
each respective iteration and after the preselected number of 
documents has been identified, determining if an additional 
document of the sample has been identified having a prob- 
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ability greater than the probability threshold, replacing the 46. The process according to claim 44 wherein the pre- 
previously-identified document having the lowest probabil- selected number is equal to the number of the respective 
ity by the additional document, and resetting the probability iteration, 
threshold to the probability of the identified document 

having the new lowest probability. ***** 
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